Automated Clinical Coding
for Outpatient Departments

Viktor Schlegel Abhinav Ramesh Kashyap Thanh-Tung Nguyen Tsung-Han Yang Vijay Prakash Dwivedi Wei-Hsian Yin Jeng Wei Stefan Winkler \IEEEmembershipFellow, IEEE Submitted for review to “IEEE Journal of Biomedical and Health Informatics” on dd.mm.yyyy.Corresponding author: V. Schlegel ([email protected]). V. Schlegel, A. R. Kashyap, T.-T. Nguyen, T.-H. Yang, V. P. Dwivedi, and S. Winkler are with ASUS Intelligent Cloud Services (AICS), Singapore. W.-H. Yin and J. Wei are with Cheng Hsin General Hospital, Taipei. S. Winkler is also with the National University of Singapore (NUS).

Abstract

Computerised clinical coding approaches aim to automate the process of assigning a set of codes to medical records. While there is active research pushing the state of the art on clinical coding for hospitalized patients, the outpatient setting—where doctors tend to non-hospitalised patients—is overlooked. Although both settings can be formalised as a multi-label classification task, they present unique and distinct challenges, which raises the question of whether the success of inpatient clinical coding approaches translates to the outpatient setting. This paper is the first to investigate how well state-of-the-art deep learning-based clinical coding approaches work in the outpatient setting at hospital scale. To this end, we collect a large outpatient dataset comprising over 7 million notes documenting over half a million patients. We adapt four state-of-the-art clinical coding approaches to this setting and evaluate their potential to assist coders. We find evidence that clinical coding in outpatient settings can benefit from more innovations in popular inpatient coding benchmarks. A deeper analysis of the factors contributing to the success—amount and form of data and choice of document representation—reveals the presence of easy-to-solve examples, the coding of which can be completely automated with a low error rate.

{IEEEkeywords}

Health information management, Hospitals, Deep Learning, Multilabel Classification

1 Introduction

Medical records are primary sources of documentation of patient care, disease progression, and healthcare operations. To make these potentially unstructured records findable, accessible, and interoperable, they are codified by clinical coders according to a standardised vocabulary, such as the International Classification of Diseases (ICD) ontology [1]—a hierarchically arranged vocabulary of standardised codes describing medical conditions, symptoms, diagnoses and hospital procedures. These codes, in turn, are used to claim reimbursement from medical insurance, to optimise resource allocations, or as a basis to select participants for clinical trials. Being a fundamental building block for these operations, it is important to maximise the accuracy and efficiency of the coding process, which has given rise to computer-assisted clinical coding tools [2]. Research in this direction promises to improve both the speed of clinical coding and the quality of resulting codes, easing the burden of clinicians and coders alike.

Despite steady progress in clinical coding of inpatient discharge summaries [3, 4, 5, 6], there has been a lack of attention to other coding settings from the NLP community. One such example is the clinical coding of medical records that document patients’ ambulatory or outpatient hospital visits. While the setting appears closely related to the inpatient setting, formalised using the same machine-learning task of multi-label document classification, clinical coding in the outpatient setting poses distinct challenges which call for a deeper investigation.

The first difference is the underlying document type and its purpose. While discharge summaries describe a patients’ course during their stay in a hospital, outpatient notes (similarly to inpatient progress notes) document a single doctor-patient encounter. Discharge summaries tend to be long and self-sufficient documents [7]. Conversely, outpatient notes are much shorter; a collection of notes might document the progression of an ongoing condition. Consequently, outpatient notes often contain redundant information [8]. For example, when a patient visits a hospital to renew their medication without any change to the underlying condition, the doctor might copy different parts of existing documentation into the new note, including the notes’ text as well as the ICD codes.

Refer to caption — Figure 1: Overview of our proposed OPD-Reranker architecture which is optimised to re-rank the predictions of an optimised base model, taking into account available structured and unstructured additional (meta-)information in addition to the text contained in an outpatient note.

Secondly, discharge summaries are typically written by doctors and later handed over to medical coders, i.e., different staff with different responsibilities [9]. As a result, discharge summaries must be self-contained, as they are the only way to convey information between summary author and coder [10]. This is not necessarily the case in outpatient settings, where doctors both document the visit and assign appropriate codes [11]. Therefore, information required to correctly codify a note might be omitted due to time constraints, as doctors might favour operational efficiency over the completeness of documentation [12]. For example, a doctor might omit important details in the note (e.g., whether a patient experienced pain in their left or right leg), and instead simply select the appropriate ICD code to provide additional context (e.g., “M79.662: Pain in left lower leg”). This presents a major challenge to automated coding approaches because succeeding in the task requires the ability to infer such details, e.g., based on the prevalence of certain codes, by identifying distinct writing styles of doctors and learning their coding preferences [13] or by relying on additional available information, such as medications.

In this paper, we set out to investigate how well state-of-the-art automated clinical coding approaches are equipped to address the outpatient coding task. To the best of our knowledge, this is the first study to investigate the feasibility of predicting “billable” (i.e., directly usable for reimbursement purposes) ICD10 codes. The investigation is carried out on a large-scale dataset comprised of more than seven million clinical notes describing outpatient encounters of more than 550k patients, contributed by more than 200 doctors from over 50 outpatient departments. Despite the differences outlined above, we find that advances in state-of-the-art inpatient clinical coding largely translate to the outpatient setting. We propose a flexible architecture to further improve performance by exploiting available structured and unstructured additional information. We further show that simple, data-oriented adaptations drastically reduce training time and improve training stability. Finally, we present a method to exploit model confidence on easy-to-classify examples to “automate” these with minimal false-positive rate. We conclude by making a set of recommendations for researchers and practitioners to support them in similar endeavours.

Table 1: Statistics of the raw and processed OPD (“outpatient departments”) datasets in comparison with the most similar inpatient dataset, MIMIC-IV-ICD10.

	OPD-raw			OPD-dedup			MIMIC-IV-ICD10
	Train	Dev	Test	Train	Dev	Test	Train	Dev	Test
Document Type	Outpatient Notes						Discharge Summaries
Language	English and Chinese						English
Codes entered by	Doctor						Clinical Coder
Document Author is its Coder	✓						✗
Code type	Diagnoses						Diagnoses and Procedures
\hdashlineNumber of Documents	7,463K	13,282	13,274	2,381K	4,323	4,378	110,442	4,017	7,851
Number of Patients	554,917	1,000	1,108	554,917	1,000	1,108	59,114	2,189	4,380
Number of Distinct Codes	18701	1492	1494	2588	1492	1494	25,230	6,738	9,159
Mean Document Length (characters)	712	758	679	396	380	368	10,146	10,215	10,022
Mean # Codes per Document	2.85	2.86	2.78	2.39	2.40	2.30	16.1	16.2	15.8
Distinct % Codes Unseen in Train	-	0.7%	0.5%	-	15.8%	15.6%	-	13.3%	6.4%

2 Clinical Coding in Outpatient Settings

In this section, we formulate the task of clinical coding and describe the corpus of outpatient notes and evaluation metrics used in the study.

2.1 Task Formulation and Model Architectures

Clinical coding is formulated as document-level multi-label classification, where $n$ out of $N$ possible labels need to be assigned to an input document $\mathbf{X}$ . Typically, $N$ is much larger than $n$ , with an unbalanced label distribution [14, 15].

To overcome these challenges associated with clinical coding, most deep learning-based techniques utilise two key elements: The first is a document encoder, a neural network that combines the hidden representations of individual tokens to create a representation $\mathbf{H}$ of the input document $\mathbf{X}$ . The second component is a label attention mechanism which is used to obtain the label-specific document representation $\mathbf{V}$ . These label-specific representations serve as input to the label classification layer to obtain probability vector $\mathbf{P}$ for each label. The entire architecture is trained end-to-end by minimising the binary cross-entropy loss between ground truth and predicted label probabilities. During inference, labels are selected as predictions based on a decision rule, such as the probability of a label being above a pre-defined threshold.

In our study, we focus on the following four approaches, in ascending order of their performance on MIMIC-IV-ICD10 [4]:

2.1.1 CAML: Convolutional Attention Network for MultiLabel Classification

The CAML architecture [3] was the first to employ the label-attention mechanism to obtain label-specific document representations for each ICD code. It uses a single-layer CNN as the document encoder.

2.1.2 LAAT: Label Attention Model

The LAAT model uses an LSTM network as document encoder [16]. They improve the label attention layer with a structured self-attention mechanism that makes use of the hierarchy of the ICD ontology.

2.1.3 MSMN: Multiple Synonyms Matching Network

The MSMN architecture [17] enhances label attention by ICD code synonyms derived from the UMLS meta-thesaurus [18]. This is enabled by a multihead-synonym mechanism, which attends to synonyms to enhance the learned representation of the label embeddings.

2.1.4 OPD-LM-LAAT

This method employs pre-trained language models (PLM) [19] as document encoders and the label attention layer of LAAT [16]. Similarly, we investigate the application of a hospital-specialised language model.

2.1.5 OPD-Reranker

Since textual records of outpatient encounters alone can be incomplete, we develop a simple architecture that takes into account available structured (e.g., lab results) and unstructured (e.g., imaging reports) information to re-rank the predictions of another (base) model (Figure 1). Learnable embeddings $\mathbf{e}_{m}$ represent entries of each structured modality $m$ and a document encoder obtains the note representation $\mathbf{H}$ of unstructured information $\mathbf{H}^{\prime}$ . These are used obtain the final label scores in the following way:

	$\displaystyle\mathbf{E}_{L^{\prime}}$	$\displaystyle=\mathbf{E}_{L}\oplus\sum\nolimits_{m\in\mathcal{M}}\mathbf{e}_{m}$
	$\displaystyle\mathbf{E}_{L^{\prime\prime}}$	$\displaystyle=Attn_{N}(\mathbf{E}_{L}^{\prime},\mathbf{H},\mathbf{H})+Attn_{M}(\mathbf{E}_{L}^{\prime},\mathbf{H}^{\prime},\mathbf{H}^{\prime})$
	$\displaystyle\mathbf{P}^{\prime}$	$\displaystyle=W_{P}\mathbf{E}_{L^{\prime\prime}}+b_{P}$
	$\displaystyle\mathbf{P}_{f}$	$\displaystyle=\mathbf{P}^{\prime}+\mathbf{P}$

where $\mathbf{P}$ is the base model prediction, $\mathbf{E}_{L}$ are the ICD-label embeddings, $\mathcal{M}$ is the set of all available structured information modalities, $W_{P}$ and $b_{P}$ are the learnable weight matrix and bias vector of the final projection layer, respectively, and $Attn(Q,K,V)$ calculates the multi-head attention with query $Q$ , key $K$ and value $V$ (and the corresponding learnable weights $W_{Q}$ , $W_{K}$ and $W_{V}$ for each head and $W_{O}$ to project the concatenation of each heads’ outputs). More specifically, $Attn_{N}$ attends to each token of the hidden note representation $\mathbf{H}$ and $Attn_{M}$ attends to each token of the representation of unstructured additional information $\mathbf{H}^{\prime}$ . $\oplus$ denotes label-wise addition, where the vector representing the sum of all modalities’ $\mathbf{e}$ is added to each label embedding as row of the matrix $\mathbf{E}_{L}$ .

2.2 Evaluation Metrics

Adoption of automated clinical coding tools can be challenging specifically in outpatient scenarios, as coders might feel “hostile” towards automated tools which could potentially replace them [20]. Furthermore, state-of-the-art results on inpatient benchmarks [21, 4] have shown that the performance of existing automated clinical coding tools renders them unsuitable as replacement for human coders.

Therefore, in addition to the usual metrics employed for evaluation of multi-label classification problems, i.e., AUC and F1 (both micro- and macro-averaged), we also use metrics that emphasise the assistance aspects of clinical coding [22]. Therefore, we measure Recall@ $k$ , i.e., how many correct labels out of all correct ones are in the top $k$ scored predictions when ranked by their predicted probability. This allows us to approximate the performance in the scenario where a clinical coding system recommends a set of labels to a coder, leaving the final decision to a human. We set $k$ to 5, based on insights about the working memory capacity of humans [23]. Additionally, Recall@ $k$ is threshold-free, therefore, no decision rule is needed to convert from label probabilities to the predicted label set. This allows us to compare the predictions directly, without being influenced by their decision rules.

In addition to the recall-based metric, to understand the performance on each evaluation sample, we calculate an instance-averaged iF1 score—the harmonic mean between precision and recall—for each instance and take the mean across all instances. For the cases with iF1 of 1, the predictions are an exact match with the correct ground truth labels.

2.3 Corpus of Outpatient Notes

To build a dataset to optimise data-driven clinical coding approaches, we collected outpatient encounter data from the Cheng Hsin General Hospital in Taipei, Taiwan, after obtaining approval from their institutional review board (number (774)109A-14). This dataset consists of over 7 million outpatient notes recording visits of over 550,000 patients, codified with their corresponding ICD10 labels. The statistics of the dataset are described in Table 1.

The input documents are shorter than their inpatient counterparts, fewer codes are assigned per note on average, and procedure codes are excluded from the documentation, which might suggest that the task is considerably easier. However, for outpatient encounters in Taiwan hospitals, attending doctors codify the documents [11], while coders spot-check these records a posteriori. This may lead to omitted textual information described earlier. Another challenge is bilinguality—while the primary documentation language is English, doctors may write parts of the note in (traditional) Chinese.

Similar to MIMIC-IV-ICD10, codes are unevenly distributed, such that 50% of the ICD10 codes are associated with five or fewer training instances (five for MIMIC-IV). Furthermore, the ratio of patients to medical records is much higher than comparable inpatient coding datasets (e.g., two discharge summaries on average per patient for MIMIC-IV). This is relevant, because unlike discharge summaries, which summarise a whole hospital course, outpatient notes keep record of each doctor-patient encounter. Many encounters are similar, e.g., when a patient presents with a chronic illness or needs to renew a drug prescription. To speed up the documentation process, doctors “ditto” or copy information from previous encounters, which may include free text, prescriptions and assigned ICD codes. As such, outpatient documentation more closely resembles inpatient progress notes, which describe the daily progress of a patients’ treatment.

To circumvent leakage from training to evaluation data due to this “ditto” practice, we split the dataset into training and evaluation portions by patient. This ensures that models are evaluated on their capability to generalise to new cases, rather than memorising patients from training data. Specifically, we hold out 1000 and 1108 random patients for the development and test sets, respectively, to have a similar number of cases in both sets. To further align the evaluation with potential application scenarios, we remove “ditto” duplicates from evaluation data, since the labels for these cases are taken over from previous encounters, and there is no need for automated tools to predict them. Because “ditto” information is not explicitly marked, we approximate duplicates by ordering records by date, matching ground truth ICD-10 codes for each patient and removing subsequent records where the label sets are an exact match. This leaves us with 4323 and 4378 notes for the development and test sets, respectively (Table 1).

2.4 Data pre-processing

While removing duplicates makes sense for evaluation data, it is unclear whether duplicate entries can help models during training. As we detect duplicates based on patient and label set, the input text might be changed between duplicates, due to e.g., incoming lab results or new developments of the same disease. Furthermore, while low-frequency codes have been reported as a challenge for clinical coding models [21], their practical impact on the problem is often unclear, even if many labels in the dataset are rare.

To answer these questions empirically, we optimise a baseline CAML model on 3 versions of the data: the full training set, a de-duplicated version, and a de-duplicated version where only labels that are observed at least 100 times in the training set are retained (Full, Dedup and Min100, respectively). Note that for the development and test sets, we do not remove any labels. We compare the performance among them and to the theoretical best performance, where we predict a label if it’s in the ground truth and appears at least 100 times in the training set (Oracle). The results are reported in Table 2. Regarding duplicates, we see that the model greatly benefits from removing “ditto” with an absolute performance improvement of 14 points over the model that was trained on the full set, while reducing training time by 30%. Disregarding rare labels during training has limited impact (below 3 points) for an oracle that predicts any other label correctly. For actual models there is almost no noticeable performance degradation. The training time is further reduced, which stems from a combination of faster convergence and lower parameter count due to the reduced number of labels and their embeddings.

Table 2: R@5 on the development set of CAML-models trained in different data configurations.

Category	R@5 dev	Train time (h)
Oracle	$99.26$	-
Oracle+Min100	$96.29$	-
\hdashlineFull	55.66	30.2h
Dedup	69.06	19.0h
Dedup+Min100	68.41	9.4h

Based on these insights, we continue our experiments with a “ditto”-deduplicated training set where all labels occurring less than 100 times are removed. The statistics of our training set are described in Table 1, in the column “OPD-dedup”. Notably, the average length of non-ditto instances is considerably shorter (396 vs 712). This suggests that the medical records of patients who visit the hospital repeatedly for treatment of the same condition “grow” by incorporating new information, such as lab results, new symptoms or similar.

3 Empirical Study

We broadly aim to investigate the feasibility of automating clinical coding for outpatient departments. More specifically, we seek evidence towards the following research questions:

(i)

Can solutions to automated coding of inpatient discharge summaries be applied to outpatient clinical coding? Do improvements upon the state-of-the-art in inpatient coding translate to outpatient settings?
(ii)

What is the impact of different document encoders on coding performance?
(iii)

What is the relation between training data and performance? Does more training data translate to better performance?
(iv)

On what proportion of the data do models’ predictions exactly match the set of ground-truth labels? Can these examples be identified reliably?

Table 3: AUC, F1 and Recall@

5

scores for the baselines on the development and test sets of the OPD dataset.

Model	AUC		F1			Recall
Model	Macro	Micro	Macro	Micro	Instance	R@5
CAML	$88.67$ / $88.84$	$96.16$ / $96.54$	$14.93$ / $14.96$	$42.12$ / $41.79$	$51.58$ / $51.43$	$68.41$ / $68.81$
LAAT	$94.83$ / $94.98$	$98.26$ / $98.43$	$17.19$ / $17.31$	$58.43$ / $57.56$	$59.38$ / $58.51$	$74.83$ / $75.25$
MSMN	$\mathbf{97.40}$ / $\mathbf{97.98}$	$\mathbf{98.67}$ / $\mathbf{99.01}$	$16.92$ / $17.21$	$57.34$ / $57.43$	$59.19$ / $59.58$	$74.23$ / $75.24$
OPD-LM-LAAT	$94.98$ / $95.70$	$98.58$ / $98.82$	$20.44$ / $21.21$	$62.37$ / $62.52$	$66.81$ / $66.90$	$79.12$ / $79.62$
\hdashlineOPD-Reranker	$94.58$ / $95.14$	$98.33$ / $98.54$	$\mathbf{20.83}$ / $\mathbf{21.46}$	$\mathbf{62.94}$ / $\mathbf{63.09}$	$\mathbf{67.05}$ / $\mathbf{67.47}$	$\mathbf{79.23}$ / $\mathbf{80.06}$

Regarding (i), it is unclear if discharge summary coding approaches will perform well in the outpatient scenario due to the difference between in- and out-patient data. To this end, we adapt the state-of-the-art clinical coding approaches discussed in Section 2.1[3, 16, 17, 19] that have been shown to perform well in inpatient settings [4] to the outpatient scenario.

For question (ii), there is conflicting evidence whether using transformer-based language models as document encoder improves upon traditional word embeddings [17, 19], and if their domain-specific pre-training is beneficial. We perform an ablation study to observe the difference in performance when using randomly initialised, domain-specific, or hospital-specific embeddings, for both word-vector and language-model based document encoders.

Answering question (iii), investigating model performance as function of the training set size can provide insights on obtaining well-performing models when constrained by available training data or hardware resources.

Finally, for question (iv), we threshold over model prediction probabilities to identify instances where predictions exactly match the ground truth label, subject to a fixed false positive rate. In practice, examples identified in such a way can be exempted from human review, further reducing the cognitive load of coding doctors.

3.1 Implementation Details

We adapt all evaluated approaches to the ICD-10 setting. For models that rely on word embeddings, we optimise these on available training data in line with literature [3]. For OPD-LM-LAAT that uses a language model, we train a custom BERT MLM with three layers, four heads, hidden dimension of 512 and feed-forward dimension 2048 on all training notes. We decide to train from scratch, to accommodate tokenisation of domain-specific terms [24].

For the MSMN model, we obtain code descriptions from the 2016 release of the ICD10-CM ontology¹¹1 https://www.cms.gov/Medicare/Coding/ICD10/2016-ICD-10-CM-and-GEMs and their synonyms from the UMLS Metathesaurus [18].

For the re-ranker model, we utilise prescribed medication, carried out procedures, doctor ID and the doctor’s department as structured information, and the textual description (i.e., the name) of medications and procedures as unstructured information. Note that because multiple prescriptions and procedures can be associated with a single outpatient visit, we take the average of their embeddings (multi-hot-encoding). We use OPD-LM-LAAT as the base model to re-rank, freeze its weights when obtaining $\mathbf{P}$ and $H$ and share its document encoder with the reranker.

We train all models according to the hyper-parameters reported in the respective literature and use early stopping for CAML, LAAT and MSMN and monitor Recall@ $5$ . Further hyper-parameters are reported in the Appendix. All experiments were carried out on a NC24sv3 Azure instance with four V100 GPUs, 24 vCPUs and 448GB of RAM. CAML and LAAT were trained on a single GPU, while OPD-LM-LAAT and MSMN were trained using two GPUs each.

3.2 Results and Analysis

In this section, we report the results of our study and relate them to the guiding questions.

3.2.1 Existing clinical coding architectures can be applied in the outpatient context and additional information is helpful

Table 3 reports the performance of all four models on the development and test sets. The general performance trend observed on inpatient benchmarks carries over to our OPD dataset as well. The MSMN model constitutes an exception, as it performs worse than LAAT in terms of R@5 score but better regarding the AUC metrics. One possible reason might be that—as reported by others [4]—MSMN tends to perform better on rare codes, which are excluded from our training set by design. Incorporating additional information by training our proposed re-ranker model on top of the best performing OPD-LM-LAAT further improves performance.

Even after removal of “ditto” instances, patients can appear more than once in the test set, if they present with new conditions and therefore new codes. Comparing the performance on such “recurring” patients to patients at their first visit, the performance drops (Recall@ $5$ of $82.63$ vs $77.20$ for OPD-LM-LAAT), possibly because new codes are added with relatively little accompanying documentation [12] and because the training data is biased towards “first-visit” patients due to our de-duplication method.

Further inspection of the performance broken down by department (Figure 2), reveals that the performance varies greatly by department, with no correlation between a departments’ score and its number of examples in the training set. There is, however, a strong (anti-)correlation with the overall number of labels encountered per year (Spearman’s $r=-0.68$ , $p<.005$ ). Similarly, Figure 3 shows that the performance on 150 most frequent labels (which constitute 55% of the overall test set), correlates only weakly with their support in the training set (Spearman’s $r=0.22$ , $p<.005$ ).

3.2.2 Domain-specific pre-training is helpful

We investigate the impact of the document encoder choice by fixing the label encoding mechanism and comparing different document encoders. We use LAAT for word-vector based document encoders and OPD-LM-LAAT for language-model based document encoders, as they both implement the same label attention mechanism [19]. We additionally use LAAT’s word embeddings[16] and BioLM, a RoBERTa-base model optimised on MIMIC-III notes [25] as domain-specific document encoders (labelled mimic). We also report performance on randomly initialised word embeddings and LM weights. Table 4 shows that LM-based document encoders outperform word-vector based ones.

Table 4: iF1 and R@5 scores on the development and test datasets when using word vector and Language Model based encoders. Each category is optimised on hospital data, initialised randomly or optimised on MIMIC-III notes, respectively.

Note Encoder iF1 R@5 dev test dev test OPD-w2v $59.38$ $58.51$ $74.83$ $75.25$ random-w2v $57.92$ $57.53$ $71.36$ $72.54$ mimic-w2v $57.26$ $55.28$ $73.01$ $72.97$ \hdashlineOPD-LM $\mathbf{66.81}$ $\mathbf{66.89}$ $\mathbf{79.12}$ $\mathbf{79.62}$ random-LM $5.00$ $5.12$ $16.83$ $16.45$ mimic-LM $66.62$ $66.64$ $78.45$ $78.75$

Furthermore, domain-specific pre-training improves performance compared to random embeddings and a randomly-initialised language model. The latter makes sense, as language models are typically trained for millions of steps to converge [26]. Finally, comparing the performance of domain-specific to hospital-specific document encoders, we see a clear benefit for both word embeddings and hospital-specific languages, as they outperform their MIMIC-III counter-parts. This is especially remarkable because the hospital-specific model is much smaller compared to BioLM (24.5 vs. 124.4 million parameters respectively).

3.2.3 Models can achieve good performance with little training data

Figure 4 shows the scores of OPD-LM-LAAT, when trained on random fractions of available training data. Concerning Recall@ $5$ , the scores saturate quickly, with a model trained on 5 percent of the data achieving 90 percent of the performance of the model trained on the full dataset. As such, more training data only yields diminishing returns of score improvement. The picture is less obvious for iF1 scores, where 10% data is needed to surpass the 90% performance threshold. Training a model on 50% of the data can reach 99% performance of the model trained on the full dataset for both Recall@ $5$ and iF1.

Figure 4: Recall@5 and iF1 performance of OPD-LM-LAAT, when optimised on a random sample of 1, 5, 10, 25, 33, 50, 75, 90% of data as fraction of score when optimised on full dataset.

This quick saturation could hint at the presence of “easy” examples which the models quickly learn to solve correctly, achieving an iF1 and Recall@5 score of 1 on these. Indeed, Figure 5 shows that the distribution of iF1 and Recall@5 scores is skewed towards the last $(0.9,1.0]$ bin. In fact, $33.6\%$ and $61.4\%$ of the examples have an instance-averaged iF1 and Recall@ $5$ score of 1, respectively.

Figure 5: Distribution of iF1 and Recall@

5

scores on the test set of OPD-LM-LAAT.

We find that overall, shorter examples with less codes tend to be easier (Spearman $r=-0.25$ and $r=-0.14$ correlation between iF1 score and length and number of codes, respectively, both $p<0.005$ ).

3.2.4 Easy examples can be identified with low false-positive rate

Given an input note and the model predictions upon it, we are looking for a rule to decide whether the model predictions are an exact match to the ground truth label set. We evaluate the efficacy of the decision rule by measuring the number of correctly identified examples at maximum allowed false positive (max FP) rate of 5, 10, 15 and 20%, respectively.

The decision rule can be chosen arbitrarily—for simplicity we use the confidence thresholding method: we choose an example if the predicted label probabilities are all above threshold $t_{u}$ and all other labels’ probabilities are below $t_{l}$ . Using the predictions of OPD-LM-LAAT, we exhaustively search all possible $t_{u},t_{l}\in[0,1]^{2}$ combinations in $0.05$ increments and select the one with the highest number of selected exact match predictions on the development set, subject to max FP. The resulting number of identified examples in the test set is shown in Figure 6. Furthermore, previous research has shown that neural networks’ probabilities do not necessarily represent the models’ confidence [27]. To this end, we calibrate the models’ predictions using label-wise isotonic regression [28], where for each label, all predictions on the development set are sorted into bins for which a piece-wise constant function is fit on ground-truth labels to map from raw to calibrated probabilities.

We find that this method indeed reduces the expected calibration error (ECE) for 73% of the labels ( $8.3\cdot 10^{-4}$ to $7.6\cdot 10^{-4}$ , averaged across all dev/test labels), resulting in more faithful prediction probabilities. However it lowers the overall iF1 and R@5 scores. Nonetheless, the calibration seemingly helps to identify high- and low-confidence predictions better, as it improves the confidence thresholding method specifically for high $t_{u}=0.95$ and low $t_{l}=0.10$ , i.e., at the lowest false positive error rate of $0.05$ , as shown in Figure 6. Overall, the best method can identify 50% of the exact match instances of the best models (17% of all test set notes) at the lowest max FP rate.

Figure 6: Percentage of correctly identified instances with iF1 scores of 1 (out of all possible) by using confidence thresholding with and without calibration, subject to maximum false-positive rate, i.e., the number of examples identified incorrectly as having iF1 of 1 divided by all examples identified.

4 Discussion and Related Work

We relate our findings to the following three directions emerging in the literature on clinical coding: incorporating additional knowledge by improving label and document representations, improving performance on rare labels, and replicating performance on data other than MIMIC-III.

4.0.1 Improving Representations

Different neural architectures have been proposed to represent input documents, including CNNs [3], LSTMs [16] and pretrained language models [29, 19] or their combinations [30]. For inpatient settings, conflicting claims regarding the efficacy of pretrained language models have been put forward [17, 19], with the best-performing model on MIMIC-III employing a combination of both, the LSTM-based MSMN architecture to obtain an initial ranked list, followed by a transformer-based language model used for reranking [21]. Regarding label representations, previous works have proposed the label-attention mechanism [31, 3], which was further refined to incorporate label hierarchy and co-occurrences [32, 16, 33], label descriptions [34] and synonyms [17] and other external knowledge [35].

We find that for outpatient coding, transformer-based document encoders clearly out-perform word embedding based ones. Furthermore, leveraging hospital-specific information by using a language model that was trained on hospital data further improves performance. Additionally, transformer-based encoders might perform so well in our setting, because the input documents are shorter than the 512 token limit of most transformers, such that no advanced input chunking strategies are required [19].

4.0.2 Rare labels

Some previous works made efforts to improve the coding frequency on rare [21] or unseen labels [36, 37]. Conversely, we find that from a practical perspective, rare and unseen codes hardly play a role in the overall performance of the models, when focusing on instance-averaged metrics, both in theory and practically.

4.0.3 Replicating performance

Finally, some works have extended the evaluation of their methods beyond the MIMIC-III dataset. Most recently, MIMIC-IV benchmarks were introduced [4, 38], including portions of records associated with ICD-10 codes, the up-to-date ontology used in practice. Others have applied their methods to other datasets, such as data from other US hospitals [39, 40, 41] or countries [42, 43, 44, 45]. To the best of our knowledge, all these studies focus on inpatient or emergency departments, as opposed to outpatient visits discussed in our paper.

4.0.4 Label Quality

Potentially erroneous human-assigned labels pose a well-documented problem for evaluation of automated clinical coding approaches [46, 47, 48]. We also note that the performance of evaluated models plateaus around 80% Recall@ $5$ , i.e. retrieving 4 out of 5 codes on average. To estimate the quality of annotations, we measure the (in)consistency of the labels by cross-referencing the ICD10 labels assigned to the same patient between inpatient and subsequent outpatient visits occurring within at most seven days. We find instances of code inconsistency, such as “E78.2: Mixed hyperlipidemia” (inpatient) and “E78.5: Hyperlipidemia, unspecified” (outpatient) for the same patient. A similar situation can be observed with “I70.203: Unspecified atherosclerosis of native arteries of extremities, bilateral legs”, and “I70.202: Unspecified atherosclerosis of native arteries of extremities, left leg”, for the two respective visits by a patient. Of 493 matches between the labels in our data, we find an inconsistency rate of 20.08% at the level 3 or higher (i.e., two matched codes have the same chapter—the first three characters—but differ in any character afterwards) in the ICD10 codes. We deem these co-occurrences in quick succession unlikely (e.g., it is highly unlikely that the condition of mixed hyperlipedemia changes to another, unspecified hyperlipidemia within less than a week) and interpret them as coding errors. Extrapolating this finding to the full dataset, we assume that 80% of our labels are correct, close to the best-performing re-ranker model Recall@ $5$ score, suggesting that this model already reaches the possible performance ceiling. Note that this assumption is based on suggestive rather than conclusive evidence, as the selection criterion of patients that were transferred from inpatient to outpatient settings within a week might introduce an unknown bias.

4.0.5 Study Limitations

Similar considerations as with the MIMIC-III and -IV datasets apply: our study was carried out on a dataset from one hospital in a specific country. As such, even though encompassing over half a million patients, our dataset might lack diversity. To alleviate this and make our conclusions more robust, our findings will be replicated on other outpatient clinical coding datasets as future work. The only comparable study of outpatient clinical coding [49] does not encompass the whole hospital and focusses only on five outpatient departments. More importantly, they only predict “parent” ICD-10 codes up to the first three characters (i.e., “E11” instead of the billable “E11.9” code). As such, their developed models are not directly able to predict billable codes; therefore the study is of preliminary nature. However, regarding by-department breakdown, they report trends similar to our results, both regarding (relative) per-department performance as well as the diversity of codes encountered per department.

Furthermore, we empirically choose the best hyper-parameters for each of the evaluated models. To obtain more robust final performance numbers, this process can be complemented by an exhaustive hyper-parameter search. Hyper-parameters can greatly impact the final performance [50], with the CAML model outperforming multiple subsequent incremental improvements after a careful selection of hyper-parameters. However, it has been shown to not out-perform the baseline model architectures selected for our study [4], which gives us confidence that the trends in performance we report are robust to hyper-parameter choice.

5 Conclusion

In this paper we have investigated the feasibility of automated clinical coding approaches when applied to assisting doctors in the outpatient settings. Our results indicate that generally, advancements in the state-of-the-art on publicly available benchmarks for clinical coding [14, 15, 4] can be transferred to the inpatient setting.

Based on our analysis, we formulate the following recommendations for researchers or practitioners who aim to replicate our results on other outpatient datasets. Train your own Language Model, as hospital-specialised, transformer-based document encoders have outperformed other document encoder approaches in our experiments. Remove noise, such as duplicate entries or low-frequency codes, as it greatly improves training stability and speed. However, utilise additional information—our proposed re-ranking framework incorporates both structured and free-text additional information, which, as our experiments show, further improves performance over text-only approaches. Start training early, even if lacking annotated data, as we have could achieve good performance with only a fraction of the available data. Find easy examples, as labels for these can be pre-selected automatically, further increasing the operational efficiency of outpatient doctors.

To address the suspected label inconsistencies, we aim to further expand on our methods to detect them, for example by using model confidence scores or model interpretability methods to identify conflicting label assignments. This can be used to notify practitioners during the coding process, ultimately improving coding consistency and reducing errors.

References

[1] WHO, “ICD-10 international classification of diseases,” Geneva: World Health Organization, 1993.
[2] M. H. Stanfill, M. Williams, S. H. Fenton, R. A. Jenders, and W. R. Hersh, “A systematic literature review of automated clinical coding and classification systems,” Journal of the American Medical Informatics Association, vol. 17, pp. 646–651, 11 2010.
[3] J. Mullenbach, S. Wiegreffe, J. Duke, J. Sun, and J. Eisenstein, “Explainable Prediction of Medical Codes from Clinical Text,” NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, vol. 1, pp. 1101–1111, 2018.
[4] T.-T. Nguyen, V. Schlegel, A. Kashyap, S. Winkler, S.-S. Huang, J.-J. Liu, C.-J. Lin, A. Singapore, and . Taipei, “Mimic-IV-ICD: A new benchmark for eXtreme MultiLabel Classification,” arXiv preprint arXiv:2304.13998, 4 2023.
[5] R. Kaur, J. A. Ginige, and O. Obst, “AI-based ICD coding and classification approaches using discharge summaries: A systematic literature review,” Expert Systems with Applications, vol. 213, p. 118997, 3 2023.
[6] T. T. Nguyen, V. Schlegel, A. Kashyap, and S. Winkler, “A Two-Stage Decoder for Efficient ICD Coding,” Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 4658–4665, 2023.
[7] J. Wimsett, A. Harper, and P. Jones, “Review article: Components of a good quality discharge summary: A systematic review,” EMA - Emergency Medicine Australasia, vol. 26, pp. 430–438, 10 2014.
[8] A. Rule, S. Bedrick, M. F. Chiang, and M. R. Hribar, “Length and Redundancy of Outpatient Progress Notes Across a Decade at an Academic Medical Center,” JAMA Network Open, vol. 4, pp. e2115334–e2115334, 7 2021.
[9] L. Roberts, S. Araromi, and O. Peatman, “Clinical coding - an insight into healthcare data,” The British Student Doctor, vol. 2, p. 36, 6 2018.
[10] S. A. R. Nouraei, J. S. Virk, A. Hudovsky, C. Wathen, A. Darzi, and D. Parsons, “Accuracy of clinician-clinical coder information handover following acute medical admissions: implication for using administrative datasets in clinical outcomes management,” Journal of Public Health, vol. 38, pp. 352–362, 6 2016.
[11] F. W. Liang, L. Y. Wang, L. Y. Liu, C. Y. Li, and T. H. Lu, “Physician code creep after the initiation of outpatient volume control program and implications for appropriate ICD-10-CM coding,” BMC Health Services Research, vol. 20, pp. 1–7, 2 2020.
[12] L. M. Schilling, L. A. Crane, A. Kempe, D. S. Main, M. R. Sills, and A. J. Davidson, “Perceived frequency and impact of missing information at pediatric emergency and general ambulatory encounters,” Applied Clinical Informatics, vol. 1, no. 3, pp. 318–330, 2010.
[13] S. E. Pollard, P. M. Neri, A. R. Wilcox, L. A. Volk, D. H. Williams, G. D. Schiff, H. Z. Ramelson, and D. W. Bates, “How physicians document outpatient visit notes in an electronic health record,” International Journal of Medical Informatics, vol. 82, pp. 39–46, 1 2013.
[14] A. E. Johnson, T. J. Pollard, L. Shen, L. W. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, and R. G. Mark, “MIMIC-III, a freely accessible critical care database,” Scientific Data 2016 3:1, vol. 3, pp. 1–9, 5 2016.
[15] A. E. Johnson, L. Bulgarelli, L. Shen, A. Gayles, A. Shammout, S. Horng, T. J. Pollard, B. Moody, B. Gow, L. w. H. Lehman, L. A. Celi, and R. G. Mark, “MIMIC-IV, a freely accessible electronic health record dataset,” Scientific Data, vol. 10, 12 2023.
[16] T. Vu, D. Q. Nguyen, and A. Nguyen, “A Label Attention Model for ICD Coding from Clinical Text,” IJCAI International Joint Conference on Artificial Intelligence, vol. 4, pp. 3335–3341, 7 2020.
[17] Z. Yuan, C. Tan, and S. Huang, “Code Synonyms Do Matter: Multiple Synonyms Matching Network for Automatic ICD Coding,” Proceedings of the Annual Meeting of the Association for Computational Linguistics, vol. 2, pp. 808–814, 2022.
[18] O. Bodenreider, “The Unified Medical Language System (UMLS): integrating biomedical terminology,” Nucleic Acids Research, vol. 32, pp. D267–D270, 1 2004.
[19] C. W. Huang, S. C. Tsai, and Y. N. Chen, “PLM-ICD: Automatic ICD Coding with Pretrained Language Models,” ClinicalNLP 2022 - 4th Workshop on Clinical Natural Language Processing, Proceedings, pp. 10–20, 2022.
[20] M. Stanfill, “Coding Professionals’ Feelings toward Computers and Automated Coding,” Perspectives in Health Information Management, CAC Proceedings, 2008.
[21] Z. Yang, S. Wang, B. Pratap, S. Rawat, A. Mitra, and H. Yu, “Knowledge Injected Prompt Based Fine-tuning for Multi-label Few-shot ICD Coding,” in Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 1767–1781, 2022.
[22] S. Campbell and K. Giadresco, “Computer-assisted clinical coding: A narrative review of the literature on its benefits, limitations, implementation and impact on clinical coding professionals,” Health Information Management Journal, vol. 49, pp. 5–18, 1 2020.
[23] R. W. Engle, “What is working memory capacity?,” The nature of remembering: Essays in honor of Robert G. Crowder., pp. 297–314, 10 2004.
[24] E. Lehman, E. Hernandez, D. Mahajan, J. Wulff, M. J. Smith, Z. Ziegler, D. Nadler, P. Szolovits, A. Johnson, and E. Alsentzer, “Do We Still Need Clinical Language Models?,” arXiv preprint arXiv:2302.08091, 2 2023.
[25] P. Lewis, M. Ott, J. Du, and V. Stoyanov, “Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art,” in Proceedings of the 3rd Clinical Natural Language Processing Workshop, pp. 146–157, Association for Computational Linguistics (ACL), 11 2020.
[26] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), (Stroudsburg, PA, USA), pp. 4171–4186, Association for Computational Linguistics, 2019.
[27] J. Vaicenavicius, D. Widmann, C. Andersson, F. Lindsten, J. Roll, and T. B. Schön, “Evaluating model calibration in classification,” in Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, pp. 3459–3467, PMLR, 4 2019.
[28] B. Zadrozny and C. Elkan, “Transforming classifier scores into accurate multiclass probability estimates,” Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 694–699, 2002.
[29] D. Pascual, S. Luck, and R. Wattenhofer, “Towards BERT-based Automatic ICD Coding: Limitations and Opportunities,” Proceedings of the 20th Workshop on Biomedical Language Processing, BioNLP 2021, pp. 54–63, 2021.
[30] T. Zhou, P. Cao, Y. Chen, K. Liu, J. Zhao, K. Niu, W. Chong, and S. Liu, “Automatic ICD Coding via Interactive Shared Representation Networks with Self-distillation Mechanism,” ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference, pp. 5948–5957, 2021.
[31] P. Xie, H. Shi, M. Zhang, and E. P. Xing, “A Neural Architecture for Automated ICD Coding,” ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers), vol. 1, pp. 1066–1076, 2018.
[32] M. Falis, M. Pajak, A. Lisowska, P. Schrempf, L. Deckers, S. Mikhael, S. A. Tsaftaris, and A. Q. O’Neil, “Ontological attention ensembles for capturing semantic concepts in ICD code prediction from clinical text,” LOUHI@EMNLP 2019 - 10th International Workshop on Health Text Mining and Information Analysis, Proceedings, pp. 168–177, 2019.
[33] P. Cao, Y. Chen, K. Liu, J. Zhao, S. Liu, and W. Chong, “HyperCore: Hyperbolic and Co-graph Representation for Automatic ICD Coding,” Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 3105–3114, 2020.
[34] M. Feucht, Z. Wu, S. Althammer, and V. Tresp, “Description-based Label Attention Classifier for Explainable ICD-9 Classification,” W-NUT 2021 - 7th Workshop on Noisy User-Generated Text, Proceedings of the Conference, pp. 62–66, 2021.
[35] T. Wang, L. Zhang, C. Ye, J. Liu, and D. Zhou, “A Novel Framework Based on Medical Concept Driven Attention for Explainable Medical Code Prediction via External Knowledge,” Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 1407–1416, 2022.
[36] A. Rios and R. Kavuluru, “Few-Shot and Zero-Shot Multi-Label Learning for Structured Label Spaces,” Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, pp. 3132–3142, 2018.
[37] J. Lu, L. Du, M. Liu, and J. Dipnall, “Multi-label Few/Zero-shot Learning with Knowledge Aggregated from Multiple Label Graphs,” EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pp. 2935–2943, 2020.
[38] J. Edin, A. Junge, J. D. Havtorn, L. Borgholt, M. Maistro, T. Ruotsalo, and L. Maaløe, “Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study,” in Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, (New York, NY, USA), pp. 2572–2582, ACM, 7 2023.
[39] Z. Zhang, J. Liu, and N. Razavian, “BERT-XML: Large Scale Automated ICD Coding Using BERT Pretraining,” in Proceedings of the 3rd Clinical Natural Language Processing Workshop, pp. 24–34, Association for Computational Linguistics (ACL), 11 2020.
[40] R. Kavuluru, A. Rios, and Y. Lu, “An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records,” Artificial Intelligence in Medicine, vol. 65, pp. 155–166, 10 2015.
[41] A. Rios and R. Kavuluru, “Neural transfer learning for assigning diagnosis codes to EMRs,” Artificial Intelligence in Medicine, vol. 96, pp. 116–122, 5 2019.
[42] C. Lin, C. J. Hsu, Y. S. Lou, S. J. Yeh, C. C. Lee, S. L. Su, and H. C. Chen, “Artificial Intelligence Learning Semantics via External Resources for Classifying Diagnosis Codes in Discharge Notes,” J Med Internet Res 2017;19(11):e380 https://www.jmir.org/2017/11/e380, vol. 19, p. e8344, 11 2017.
[43] E. Moons, A. Khanna, A. Akkasi, and M. F. Moens, “A Comparison of Deep Learning Methods for ICD Coding of Clinical Records,” Applied Sciences 2020, Vol. 10, Page 5262, vol. 10, p. 5262, 7 2020.
[44] V. Mayya, S. Sowmya Kamath, G. S. Krishnan, and T. Gangavarapu, “Multi-channel, convolutional attention based neural model for automated diagnostic coding of unstructured patient discharge summaries,” Future Generation Computer Systems, vol. 118, pp. 374–391, 5 2021.
[45] H. Dong, V. Suárez-Paniagua, W. Whiteley, and H. Wu, “Explainable automated coding of clinical notes using hierarchical label-wise attention networks and label embedding initialisation,” Journal of Biomedical Informatics, vol. 116, p. 103728, 4 2021.
[46] J. Horsky, E. A. Drucker, and H. Z. Ramelson, “Accuracy and Completeness of Clinical Coding Using ICD-10 for Ambulatory Visits,” AMIA Annual Symposium Proceedings, vol. 2017, p. 912, 2017.
[47] C. Yeoh and H. Davies, “Clinical coding: completeness and accuracy when doctors take it on.,” BMJ : British Medical Journal, vol. 306, p. 972, 4 1993.
[48] N. A. Heywood, M. D. Gill, N. Charlwood, R. Brindle, C. C. Kirwan, N. Allen, P. Charleston, P. Coe, J. Cunningham, S. Duff, L. Forrest, C. Hall, S. Hassan, B. Hornung, M. al Jarabah, A. Jones, J. Mbuvi, T. Mclaughlin, J. Nicholson, J. Overton, A. Rees, H. Sekhar, J. Smith, S. Smith, N. Sung, N. Tarr, R. Teasdale, and J. Wilkinson, “Improving accuracy of clinical coding in surgery: collaboration is key,” Journal of Surgical Research, vol. 204, pp. 490–495, 8 2016.
[49] J. H. B. . Kuo, C.-C. . Yeh, C.-Y. . Yang, H.-C. . Lin, J. Hossain, B. Masud, C.-C. Kuo, C.-Y. Yeh, H.-C. Yang, and M.-C. Lin, “Applying Deep Learning Model to Predict Diagnosis Code of Medical Records,” Diagnostics 2023, Vol. 13, Page 2297, vol. 13, p. 2297, 7 2023.
[50] J. J. Liu, T. H. Yang, S. A. Chen, and C. J. Lin, “Parameter Selection: Why We Should Pay More Attention to It,” ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference, vol. 2, pp. 825–830, 2021.

Appendix A Further experiment details

Regarding the choice of hyper-parameters, we leave all settings of all implementations as recommended by the authors of the corresponding papers, with the following exceptions: For CAML and LAAT, we train our own word embeddings on the training portion of our dataset using the code provided by CAML; For MSMN, we also obtain a word frequency list from our data and set the batch size to 8; For the OPD-LM-LAAT model, we implement a custom LAAT-based multilabel classifier in the PLM-ICD implementation [19] to fit our LM architecture. We change the number of training epochs from 20 to 8, due to dataset size and train with batch size of 32; For the re-ranker model, we use two heads for both multi-head attention mechanisms. The dimensionality of all embeddings and hidden representations is 512. We train the re-ranker model for 5 epochs with batch size of 32. For all models, we set the maximum input length to 512 tokens according to their respective tokenisation methods. Comparison with the MIMIC-IV scores are reported below. Across MIMIC-IV and OPD, PLM-ICD should be compared to our OPD implementation OPD-LM-LAAT.

	MIMIC-IV		OPD
Model	F1		F1
Model	Macro	Micro	Macro	Micro
CAML	4.61	53.32	14.23	40.79
LAAT	4.47	55.40	17.31	57.56
MSMN	5.42	55.91	17.21	57.43
PLM-ICD	4.90	56.95	-	-
OPD-LM-LAAT	-	-	21.21	62.52

Automated Clinical Coding for Outpatient Departments

Abstract

1 Introduction

2 Clinical Coding in Outpatient Settings

2.1 Task Formulation and Model Architectures

2.1.1 CAML: Convolutional Attention Network for MultiLabel Classification

2.1.2 LAAT: Label Attention Model

2.1.3 MSMN: Multiple Synonyms Matching Network

2.1.4 OPD-LM-LAAT

2.1.5 OPD-Reranker

2.2 Evaluation Metrics

2.3 Corpus of Outpatient Notes

2.4 Data pre-processing

3 Empirical Study

3.1 Implementation Details

3.2 Results and Analysis

3.2.1 Existing clinical coding architectures can be applied in the outpatient context and additional information is helpful

3.2.2 Domain-specific pre-training is helpful

3.2.3 Models can achieve good performance with little training data

3.2.4 Easy examples can be identified with low false-positive rate

4 Discussion and Related Work

4.0.1 Improving Representations

4.0.2 Rare labels

4.0.3 Replicating performance

4.0.4 Label Quality

4.0.5 Study Limitations

5 Conclusion

References

References

Appendix A Further experiment details

Automated Clinical Coding
for Outpatient Departments