PharmMT: A Neural Machine Translation Approach to Simplify Prescription Directions

Jiazhao Li,¹ Corey Lester,² Xinyan Zhao,¹ Yuting Ding,²
Yun Jiang,³ V.G.Vinod Vydiswaran^4,1
¹School of Information; ²Department of Clinical Pharmacy, College of Pharmacy;
³Department of Systems, Populations, and Leadership, School of Nursing;
⁴Department of Learning Health Sciences, Medical School
University of Michigan, Ann Arbor, MI
{jiazhaol,lesterca,zhaoxy,dingyt,jiangyu,vgvinodv}@umich.edu

Abstract

The language used by physicians and health professionals in prescription directions includes medical jargon and implicit directives, and causes a lot of confusion among patients. Human intervention to simplify the language at the pharmacies may introduce additional errors that can lead to potentially severe health outcomes. We propose a novel machine translation-based approach, PharmMT, to automatically and reliably simplify prescription directions into patient-friendly language, thereby significantly reducing pharmacist workload. The proposed approach is evaluated over a dataset consisting of over 530K prescriptions obtained from a large mail-order pharmacy. The end-to-end system achieves a BLEU score of 60.27 against the reference directions generated by pharmacists, a 39.6% relative improvement over the rule-based normalization. Pharmacists judged 94.3% of the simplified directions as usable as-is or with minimal changes. This work demonstrates the feasibility of a machine translation-based tool for simplifying prescription directions in real-life.

1 Introduction

Adverse drug events stemming from medication errors are a major cause of concern in patient care and are estimated to cost US$42 billion annually or roughly 1% of total global expenditure. In the US alone, medication errors cause one death every day and are responsible for over 700,000 visits to the emergency department and over 100,000 hospitalizations each year Budnitz06; Budnitz11; WHO17a.

One of the common sources of medication errors in the US is the directions on the 1.91 billion electronic prescriptions (e-prescriptions) transmitted annually moniz2011addition; odukoya2014prescribing; odukoya2015hidden. The style and language used in e-prescriptions are highly variable and often filled with medical jargon. For example, in a recent study Yang18, the authors noted that the direction “Take 1 tablet by mouth once daily” was represented in 832 different ways. The study also found that 10.1% of e-prescriptions contained incorrect or confusing language. Pharmacists play a vital role as intermediaries between physicians and patients by translating the rich medical jargon in the e-prescriptions written by physicians to patient-comprehensible directions on the prescription labels that are printed on pill bottles. However, human translation is time-consuming and subject to errors, potentially leading to medication errors and other patient safety risks due to prescription ambiguity.

In this paper, we propose a machine translation-based system, called PharmMT, to simplify the e-prescription directions authored by physicians into patient-friendly language. The goal of this system is to automate the translation and normalization of e-prescription directions and reduce the pharmacists’ overall workload. We investigate multiple neural network based models, including transformer-based models and bi-directional LSTM models, rule-based approaches, and a hybrid model combining neural network-based models with a rule-based backoff. The proposed system is trained and evaluated over a dataset of over 530K paired e-prescriptions and their human-translated text, obtained from a large mail-order pharmacy. Using automated measures to evaluate machine translation output, we compare the performance of the PharmMT system against a rule-based approach developed based on domain knowledge from pharmacists, and show that PharmMT performs significantly better than the rule-based baseline. Manual evaluation by pharmacists shows high potential to directly apply the proposed approach in pharmacies to translate e-prescription directions.

The contributions of this work are:
1. We develop a neural machine translation model for simplifying e-prescriptions and build an end-to-end system to generate normalized, patient-friendly, and usable translations. The model achieves a BLEU score of 60.27 against the reference directions generated by pharmacists; and 94.3% of simplified prescriptions are judged as usable as-is or with minimal changes by pharmacists.
2. To the best of our knowledge, our work is the first systematic effort to study neural network models to simplify e-prescription directions. We also developed a rule-based approach as the baseline of this task. The code base for both the rule-based system and the machine translation model will be released for research use.
3. Our work adds additional insights into the limitations of purely automated evaluation metrics of machine translation for domain-adaptive tasks, and offers alternative modes of evaluation.

E-prescription direction	Simplified direction
2 puffs orally q 4 hrs x90 dys wheeze	Inhale 2 puffs by mouth every 4 hours for 90 days for wheeze
1 g vaginal mon/tu/th/fr	Insert 1 gram vaginally monday, tuesday, thursday and friday
as needed prn; 1 po qd prn	Take 1 tablet by mouth once a day as needed
oral one tab po qd prn	Take 1 tablet by mouth every day as needed
oral one tab po qd prn	Take 1 tablet by mouth daily as needed
Drug name and strength	Normalized drug name and strength
albuterol 90 mcg/inh inhalation aerosol	PROAIR HFA AER
0.1 mg/g vaginal cream	ESTRADIOL CRE 0.01%
traMADol 50 mg tablet	TRAMADOL HCL TAB 50MG

Table 1: Top: Example pairs of e-prescription directions and corresponding simplified text. Variations of these directions exist on both sides. Bottom: Drug name and strength information in the original and normalized forms.

2 Related work

Prior work on automated approaches for translating e-prescription directions mainly focused on information extraction models relied on handwritten rules or linguistic signals found in prescription free text. Tools such as MetaMap aronson2010overview and MedLEE friedman1996web extract and organize clinical information in text documents using external knowledge sources, such as the Unified Medical Language System (UMLS). Other systems such as FABLE tao2018fable employed a conditional random fields-based model to recognize medication entities. Other researchers have proposed rule-based approaches to normalize and simplify directions using task-specific knowledge such as common abbreviations used in prescriptions qenam2017text; kandula2010semantic.

While machine translation-based approaches have not yet been proposed for translating e-prescription directions, prior works such as yolchuyeva2018text; shardlow-nawaz-2019-neural; van2019evaluating have suggested solving machine translation tasks without the need for explicitly-defined rules. Neural machine translation (NMT) models have been shown to be able to learn contextual rules automatically from large corpora and produce higher quality translations bahdanau2014neural; wu2016google; lee2017fully. Other researchers, such as aw2006phrase; xu2016optimizing, have shown that while statistical machine translation methods mainly focused on lexical rules to minimizing the sentence complexity, NMT models could capture richer syntactic information shi-etal-2016-string.

Researchers studying deep neural network models have explored multiple encoder-decoder frameworks, such as Transformer-based networks NIPS2017_7181, and Recurrent Neural Networks (RNN), including Long Short-Term Memory (LSTM) models hochreiter1997long and Gated Recurrent Units (GRU) chung2014empirical. RNN units have been used to encode source sentences into fixed-length representations and then decoded into reference sentences DBLP:journals/corr/ChoMGBSB14. Models with deep LSTM-based RNN units have shown the benefits of using deeper structure DBLP:journals/corr/WuSCLNMKCGMKSJL16. In other works, DBLP:journals/corr/GehringAGYD17 introduced a convolution neural network with an attention-based mechanism to learn long-range dependency. In recent work, NIPS2017_7181 developed a novel Transformer-based architecture making use of attention mechanism without recurrence or convolution, that has resulted in state-of-the-art performance for many related natural language processing tasks, such as recognizing textual entailment, sentiment analysis, and natural language inference. raffel2019exploring; lan2019albert; devlin2018bert

3 Simplifying e-prescription directions

We frame the challenge of simplifying e-prescription directions as a machine translation task from physician-authored directions (“source”) to patient-facing text authored by pharmacists (“reference”). This monolingual translation task focuses on replacing highly-abbreviated medical jargon with patient-friendly vocabulary, simplifying cryptic expressions, and normalizing them so that they can be used with minimal changes by the pharmacists. Table 1 (top panel) shows three examples of e-prescriptions and their corresponding simplified directions.

E-prescription directions consists of specific components related to the prescribed drug, viz., dosage, form, route, duration, frequency, and reason for prescribed use. For example, the e-prescription “2 puffs orally q 4 hrs x90 dys wheeze” specifies that the patient should inhale 2 (dosage) puffs (form) by mouth (route) every 4 hours (frequency) for 90 days (duration) for wheezing (reason). While not all components are present in every e-prescription direction, some components are critical and need to be stated explicitly. The name and strength of the prescribed drug are also available as auxiliary information. Examples of drug names and strengths are shown in the bottom panel of Table 1.

While we formulate the challenge as a machine translation task in the pharmacy domain, one of the key desiderata of the automated approach is to preserve the accuracy and consistency of the key components in a prescription. To achieve this, we develop an end-to-end system called PharmMT, consisting of three stages: neural machine translation, numerical check, and graceful back-off, and normalization, as depicted in Figure 1.

Refer to caption — Figure 1: Schematic diagram of the PharmMT system

3.1 Neural Machine Translation (NMT)

The primary component of the proposed approach is a sequential model that “translates” physician-authored e-prescription text to normalized, patient-friendly language using an NMT framework. NMT models map a source sequence, $\mathbf{x:}~{}x_{1},x_{2},\ldots,x_{n}$ into a reference sequence, $\mathbf{y:}~{}y_{1},y_{2},\ldots,y_{m}$ by maximizing the conditional probability $p(\mathbf{y}|\mathbf{x})$ using an Encoder-Decoder framework DBLP:journals/corr/Neubig17. One such model is a recurrent sequence-to-sequence model, which consists of a bidirectional LSTM model schuster1997bidirectional with global attention as the encoder and a forward-sequence LSTM model hochreiter1997long as the decoder. Both encoder and decoder stages are configured as multi-layer models, with a hidden state in each layer, to sufficiently capture the deep semantic components in the e-prescription barone2017deep.

To compare against the performance of the recurrent sequence-to-sequence model, we also trained an attention-based transformer model NIPS2017_7181. Position embedding was enabled to capture sequence information and provide similar architectural complexity as a recurrent network. Both models were developed using the OpenNMT framework klein-etal-2017-opennmt; klein-etal-2018-opennmt, and the dropout probability was adjusted to prevent over-fitting srivastava2014dropout. Additional experimental details can be found in Section 4.2.

3.1.1 Augmenting auxiliary information

As described in Section 3, e-prescriptions contain auxiliary information on the drug name and strength. While the primary task is to simplify just the direction, access to the auxiliary information may help distinguish directions based on the context associated with drugs. This also matches with real-life information available to pharmacists. We hypothesize that the auxiliary information will improve the neural machine translation models to simplify directions. To test this hypothesis, we prepend the drug name and strength information to the “source” direction before training the models. The updated model is evaluated on the original task of simplifying just the directions.

3.1.2 Pre-trained word embeddings

The input representation is a pre-trained word embedding layer to allow for similar representations of words in similar contexts. Word embeddings can capture fine-grained semantic and syntactic word relationships and in turn allow for a better initialization for gradient optimization. We explored static pre-trained embedding models and compared them against a randomly-initialized representation vector. The first one was the general-domain GloVe word embeddings, pre-trained on the Wikipedia and Gigaword corpora pennington2014glove. The second one was clinical domain-adaptive word embeddings that we trained on MIMIC-III, a large corpus of clinical notes johnson2016mimic and a dataset of pharmacy directions Pharmacy_2020. Our hypothesis is that domain-adaptive word embeddings would outperform both the general-domain embeddings and randomly-initialized vector embeddings.

3.1.3 Learning ensemble models

The basic motivation of ensemble learning over neural network models is to improve the robustness of the final model to the variation and randomness introduced in parameterized modules because of dropout probability and random seeding. By training an ensemble model, the final distribution of the output dictionary is computed, during the inference phase, by averaging the output distributions from the trained models in the ensemble.

3.2 Numerical check

Once the machine translation module generates the simplified candidate directions, the candidates are checked for consistency of key components of the prescription. Numerical components – including dosage, frequency, and duration – are critical in prescriptions. Medication under-dosage often leads to poorer health outcomes; while over-dosage can be severe, even fatal.

The correctness of the numerical components in the simplified directions is checked by comparing against the source e-prescription. We incorporated two different numerical checking strategies – Token-based and NER-based. In the token-based checking, all numeric tokens that appear in the simplified direction were checked against numeric tokens in the source direction. The bag of tokens approach helps tag any simplified direction that “makes up” numeric values in a key component.

On the other hand, the token-based checker is also prone to generate more faulty consistency claims; for example, when the simplified direction swaps a dosage term with the frequency. To overcome this, we incorporated a pre-trained medication NER model XinyanNER to tag dosage, frequency, and duration components in both source and simplified directions, and compared them component-by-component. The NER model was trained over a medication extraction task henry20202018 and achieved an overall F1 score of 0.9571 over all medication components.

3.3 Graceful back-off

If the simplified direction is deemed consistent after the numeric check, the direction is considered as the final candidate for normalization. However, if the numeric check fails, the NMT output is discarded and the original source direction is used as the final candidate. This graceful backoff represents a trade-off between information accuracy and good language model performance.

3.4 Normalization

Before the candidate direction from the numerical check phase is finalized, the candidate text undergoes pharmacy-specific post-processing and simplification. Two pharmacists identified common linguistic patterns in pharmacy directions, which were coded into normalization rules. Highly-abbreviated medical jargon was replaced, for example, by replacing the Latin term bid with its synonymous phrase twice a day. Action verbs appropriate for the form of the drug, such as inject (syringe), inhale (nebulizer), and take (capsule), were added. Numerical values in words or fractions were converted to digits (e.g. 1 1/2 was converted to 1.5). Abbreviations and other common medical expressions were normalized into standard and patient-friendly variants, e.g. inj or injector to injection. Overall, the normalization step included more than 300 rules. Table 2 shows examples of these normalization and simplification rules. The normalization module also served as the rule-based baseline.

Fields	Original	Normalized
Action	(missing)	take
		inject
		inhale …
Dosage	one and half	1.5
	1 1/2	1.5
	one (1)	1
Form	tab, tabs	tablet
	cap	capsule
	in, inj, injctor	injection
Route	orally, by oral	by mouth
Route	sq, subcutaneous	under the skin
Frequency	qd	every day
Frequency	bid	twice a day
Duration	x3 week	for 3 weeks

Table 2: Sample normalization and simplification rules

4 Experimental setup

4.1 Data set description

The e-prescription corpus used in this study consists of all e-prescriptions dispensed by an online outpatient mail-order pharmacy from their dispensing software from January 2017 to October 2018. The corpus consists of a total of 530,988 e-prescriptions received from 65,139 unique physicians from all fifty US states Pharmacy_2020.

Each e-prescription direction in the data set is paired with the corresponding simplified text authored by a mail-order pharmacy team member. Table 1 shows some example e-prescription directions and the corresponding simplified text. In addition to the e-prescription, each direction in the corpus also contains auxiliary information about the name and strength of the drug. However, similar to the directions, physician-authored information often included chemical or ingredient names for the drug, while the pharmacist-translated direction contained generic or brand names for drugs.

In all, there are 120,402 unique e-prescription directions and 83,823 unique pharmacist-authored directions in the dataset. The difference in these numbers is due to the diverse writing styles of physicians and pharmacists. On an average, there were 6.33 e-prescription directions mapped to a single pharmacist-authored direction; while one e-prescription direction mapped, on an average, to 4.41 different pharmacist-authored directions.

To avoid information leak during the evaluation, we split our data into train, validation, and test sets so that there were no duplicates across the sets. Duplicate e-prescription directions were grouped and assigned to only one of the three data sets. Table 3 summarizes the distribution of the instances over the three data sets. None of the instances in the validation and test sets were used during the training phase. TODO: Explain

Data set	Train	Validation	Test
Original	318,594	79,648	132,747
Deduplicated	318,594	15,625	36,652

Table 3: Data set sizes after removing duplicates.

4.2 Training process

To prepare the data for training, the source e-prescription directions and reference pharmacist-authored directions are prepended with the drug name and strength, as described in Section 3.1.1.

Pre-trained word embeddings:

We tested two static pre-trained word embeddings – general-domain GloVe embeddings and a second one specifically trained over two clinical domain corpora – against a randomly-initialized representation. The out-of-vocabulary rate is shown in Table 4. While only 10% of the source words and 6.69% of the target words had no word embeddings in the clinical-domain word embeddings, the out of vocabulary ratio was 7 to 9 times higher in the general-domain word embeddings.

Model configuration:

The number of layers in the encoder and decoder stages of the Bi-LSTM, LSTM-based, and Transformer-based models was empirically chosen from the set {2, 4, 6, 8}. The length of hidden states was chosen from the set {128, 256, 512}. In the following description, we denote the number of layers (i.e., Transformer blocks) as L, the hidden size as H, and the number of self-attention heads as A, consistent with the BERT notation devlin2018bert. After choosing the hyper-parameter based on highest BLEU score, we primarily report results on two best performance architectures: Bi-LSTM/LSTM-based model (L = 4, H = 256, Dropout = 0.4) and Transformed-based model (L = 4, H = 128, A = 2, Dropout = 0.2). We trained our settled models on a single Tesla V100. It took 4.67 hours to train the LSTM-based models with 12.66 million parameters, and 3.33 hours for the Transformer-based model with 9.27 million parameters.

Embeddings	Vocab.	Source	Reference
Embeddings	size	(n = 11,643)	(n = 7,358)
Clinical	498,677	10%	6.69%
General	400,000	73.75%	60.29%

Table 4: Out-of-vocabulary (oov) ratio for pre-trained word embedding models. General-domain word embeddings show a large oov gap against our dataset.

4.3 Evaluation metrics

Automatic evaluation:

We evaluated the translation model using two automated candidate-reference comparison metrics: BLEU papineni2002bleu and METEOR denkowski:lavie:meteor-wmt:2014. The BLEU-4 score is the most popular metric used to evaluate the similarity between the candidate text and human reference in machine translation tasks sutskever2014sequence; koehn2003statistical; lipton2015critical. It is computed as the geometric mean of precision of unigram, bigram, trigram, and 4-gram matches between the candidate and reference texts. In contrast to the ngram-based overlap or exact match properties in BLEU (e.g. treat-treat), METEOR considers unigram alignment between candidate and reference texts. This results in more flexible matching, including stem match (treats-treat) and synonym match (sweet-treat) using WordNet miller1995wordnet. METEOR scores are known to achieve a better correlation with human judgment banerjee2005meteor.

Sample comparisons of two metrics in Table 6 (top panel) show the limitation of using the BLEU score to compare pharmacy instructions, while also showing the feasibility of using METEOR score as a viable alternative. The bottom panel of Table 6 highlights the limitation of both metrics in checking the consistency of prescription components, and is discussed in more detail in Sec. 6.4.

NMT Models	BLEU	METEOR
TransF-TransF : Random	59.09 $\pm$ 0.08	80.10 $\pm$ 0.06
TransF-TransF : GloVe	62.36 $\pm$ 0.10	80.45 $\pm$ 0.05
TransF-TransF : Clinical	63.23 $\pm$ 0.10	80.89 $\pm$ 0.17
+ Ensemble	64.19 $\pm$ 0.31	81.05 $\pm$ 0.12
BiLSTM-LSTM : Random	64.63 $\pm$ 0.06	81.14 $\pm$ 0.30
BiLSTM-LSTM : GloVe	65.78 $\pm$ 0.07	81.62 $\pm$ 0.11
BiLSTM-LSTM : Clinical	66.02 $\pm$ 0.03	82.32 $\pm$ 0.20
+ Ensemble	66.61 $\pm$ 0.29	82.73 $\pm$ 0.10
Rule-based baseline	43.19	68.59

Table 5: Comparison of NMT models with different word embeddings approaches, against a rule-based baseline. An ensemble BiLSTM-LSTM model with clinical word embeddings achieves highest BLEU and METEOR scores.

Manual evaluation:

While automated metrics such as BLEU and METEOR are commonly used to evaluate machine translation tasks, they do not sufficiently evaluate the clinical usability of the candidate texts. Hence, in addition to the automated evaluation, we asked two pharmacist trainees to evaluate 300 pairs of e-prescription directions and corresponding simplified directions randomly sampled from the test set. The pharmacist trainees were asked to classify the direction pairs into one of three categories – Correct: all information in the simplified direction was correct; Missing: the simplified direction was correct, but missed some key information; and Wrong: the simplified direction contained key errors that needed to be corrected. Labeling disagreements were resolved by a pharmacist expert. This manual evaluation simulates the human effort undertaken in real-life at the pharmacies to simplify e-prescription directions.

5 Results

The results are presented as follows: first, we evaluate the NMT module by comparing the two proposed architectures using the automated BLEU and METEOR scores. After finalizing the best performing NMT model, we compare its results against a rule-based baseline and demonstrate the significance of the individual stages through ablation tests. Finally, we report the performance of the end-to-end system based on the manual evaluation.

E-prescription direction	BLEU	METEOR
location: both eyes. 1 drop into each eye at bedtime. [Source]
apply 1 drop into each eye at bedtime. [NMT]	0	0.77
instill 1 drop into both eyes at bedtime. [Reference]
1 puff once daily per dr. jones. [Source]
inhale the contents of one capsule by mouth once daily using handihaler. [NMT]	0.38	0.90
use 1 inhalation by mouth once daily. [Reference]
Evaluation Limitation
take 1 tablet by mouth every morning and every evening. [Reference]
take 1 tablet by mouth every morning & every evening. [NMT₁]	0.70	0.91
take 10 tablets by mouth every morning and every evening. [NMT₂]	0.70	0.98

Table 6: Top: Comparison of BLEU and METEOR scores on two examples. METEOR scores show higher correlation with human judgement due to flexible matching. Bottom: Both variations of NMT model outputs have similar BLEU and METEOR scores. However, The first replaced an ‘&’ with ‘and’, while the second had a critical dosage error (‘10’, instead of ‘1’).

5.1 Evaluating Neural Machine Translation module

We compared two classes of NMT models – Transformer-based model and BiLSTM/LSTM-based model under different pre-trained word embeddings. The results are summarized in Table 5. The reported values are mean and standard deviation over 10 independent iterations of training and validation using the same model hyper-parameters.

Both automated metrics were consistent in their ranking of the systems. On both metrics, LSTM-based models outperformed transformer-based models. One possible explanation for these results is that although transformer-based models are better at capturing long-range dependencies, their advantage is nullified by the relatively short sentences in this task. The average length of e-prescription directions is 10.42 $\pm$ 4.55 tokens.

Models using the pre-trained clinical-domain word embeddings led to the highest performance on both metrics, and were statistically better than models that used general-domain word embeddings. Models using the randomly-initiated word representation performed the worst. The model performance improved further when ensemble learning was applied. The ensemble BiLSTM / LSTM model with clinical domain-adaptive word embeddings achieved the highest overall BLEU score of 66.61 $\pm$ 0.29 and METEOR score of 82.73 $\pm$ 0.10.

We also note that in Sec. 3.1.2, we stated our hypothesis that domain-adaptive word embeddings would outperform both the general-domain embeddings and randomly-initialized vector representations. This hypothesis was shown to be valid in our results in Table 5. Instead of the static, but domain-adaptive, word embeddings explored in this work, other alternatives such as contextual word embeddings could also be used, including BioBERT BioBERT and ClinicalBERT ClinicalBERT.

5.2 Comparison to rule-based baseline

Next, we compared the translated directions generated by the PharmMT model against a rule-based baseline in which the e-prescription is passed directly through the Normalization module to produce the output. Table 7 shows some examples of rule-based translation against PharmMT output. These examples highlight three potential issues of rule-based systems, viz., contextual ambiguity, reordering of direction components, and sensitivity to misspelled tokens. In the first example, the rule-based approach fails to recognize ‘90’ as duration without any contextual clues, whereas PharmMT correctly normalizes it as ‘for 90 days’. Similarly, in the second example, unlike the rule-based approach, the PharmMT model correctly reorders the direction components by inserting dosage tokens (‘3.5 tablets’) before the form token, ‘tablet.’. Rule-based approaches are also more sensitive to misspelled tokens, compared to PharmMT, as shown in the third example.

#	Issues	E-prescription	Rule-based Baseline	PharmMT
1	Contextual ambiguity	1/2 tab bid orally 90.	Take 0.5 tablet by mouth twice a day 90	Take 0.5 tablet by mouth twice a day for 90 days.
2	Re-ordering components	tablets by mouth daily; 3.5 tab 7 mg.	Take tablets by mouth daily ; 3.5 tablet 7 mg.	Take 3.5 tablets by mouth daily .
3	Handling misspelled tokens	one tablet by mouth oce daily .	Take one tablet by mouth oce daily .	Take 1 tablet by mouth once a day .
4	Informal abbreviations	1 puff aero pow br act bid.	Inhale 1 puff aero pow br act twice a day.	Inhale 1 puff by mouth twice a day .
5	Sense disambiguation	spray 1 spray(s) 4 times a day by intranasal route as needed for 90 days .	Use spray 1 spray 4 times a day in the nose route as needed for 90 days .	Use 1 spray in the nose 4 times a day as needed .

Table 7: Examples comparing PharmMT model against a rule-based baseline, highlighting potential issues with rule-based approaches.

Model variations	BLEU	METEOR
(1): Rule-based Baseline (Normalizer)	43.19	68.59
(2): Best NMT model	67.21	82.90
(3): Best NMT model w/o Auxiliary	63.59	80.83
(4): (2) + Backoff	66.24	79.59
(5): PharmMT [same as (4) + (1)]	60.27	76.11
(6): (5) - Backoff [same as (2) + (1)]	60.29	76.20

Table 8: Ablation results showing model performance taking components out one-at-a-time, from the best NMT model to end-to-end system, PharmMT (in bold).

5.3 Results of the ablation study

The best performing NMT model achieved a BLEU score of 67.21 and a METEOR score of 82.90. After finalizing the best NMT model, the significance of the remaining components is shown using an ablation study. The results are summarized in Table 8.

Removing auxiliary information:

Without augmenting auxiliary drug information, the performance of the NMT model drops by 5.4% on the BLEU score and 2.5% on the METEOR score. This shows that augmenting the directions with the auxiliary information about the name and strength of the drug helps improve the overall performance, as hypothesized in Section 3.1.1.

Graceful backoff and normalization:

Adding the NER-based numeric check and resorting to graceful backoff, when necessary, reduces the overall scores on the automated metrics (BLEU: 66.24, METEOR: 79.59). Adding normalization decreases both metrics even further (BLEU: 60.27, METEOR: 76.11). This is because normalization-based approaches tend to produce reference direction texts that are influenced heavily by preference rules from pharmacists. So, while the resultant directions are more readable, patient-friendly, and preferred by pharmacists, they do not accurately reflect the preferred style in the original reference corpus. We expand on this further in Section 6.3.

5.4 Manual evaluation of end-to-end system

The final output of the end-to-end system was evaluated by domain experts. Of the 300 pairs of e-prescription directions and their corresponding simplified texts, 86.7% (n=260) were labeled as Correct, 7.6% (n=23) as Missing; and 5.7% (n=17) as Wrong. The missing errors were primarily related to missing adjectives and adverbs (e.g. transdermal, slowly), or typographic omissions, such as brackets. The incorrect errors were primarily related to special directions (e.g. taking medications before meals, with food), formatting issues with dosage (e.g. 10-12), or complex directions based on days of the week (e.g. every day except Sundays). These results show that in 94.3% instances, the simplified output can be used as-is or after minimal changes to add the missing elements.

6 Discussion

6.1 Error analysis

We further analyzed all non-‘Correct’ instances (n=17; 5.7%) identified during the manual evaluation. A major class of errors was complicated instruction and language patterns in the e-prescription. These directions were on average 16.86 words long, compared to an average of 12.57 words for ‘Correct’ instances. For example, one direction that was not simplified correctly was: 30 units with meals plus ssi 150-200 2 units, 201-250 4 units, 251-300 6 units, 301-350 8 units, greater than 351 10 units. This direction instructed patients to change dosage depending on the sliding scale for insulin (ssi).

Based on the hypothesis that shorter directions will have simpler language patterns, we evaluated a subset of test instances (n=11,977; 32.6%) that were under 12 words long. This ‘shorter length’ subset achieved an aggregate BLEU score of 71.14, while the complementary ‘longer length’ subset managed an aggregate BLEU score of 64.02.

	Dosage	Frequency	Duration	Combined
	(n=22,099)	(n=8,256)	(n=2,879)	(n=23,237)
NER	2,570	1,451	238	4,027
Token	-	-	-	1,390

Table 9: Number of instances marked as inconsistent by the numeric checkers. NER-based checker flagged more inconsistent instances than NER-based checker.

6.2 Numeric checker and Graceful back-off

We further investigated the performance of the numeric checker. The number of instances marked as inconsistent by the two approaches is summarized in Table 9. The stricter, NER-based numeric checker flagged 4,027 (17.33%) instances as inconsistent, while the token-based checker flagged only 1,390 (5.98%) of instances as inconsistent.

6.3 Need for normalized reference

The normalization process is heavily influenced by the style preferred by pharmacists coding the normalization rules. Since the preferred style of the team that generated the original reference differed from the expert pharmacists in our team, the original reference data were themselves not normalized. This led to a reduction in BLEU scores when normalization was added (cf. Table 8).

To understand the upper bound of our trained models, we created a normalized version of the reference corpus. Using this as gold reference, the original reference corpus has a BLEU score of 82.48, and that of the PharmMT system was 62.68 (see Table 10). The table also shows the ratio of test instances that are normalized, to indicate how close the output is to the normalization rules: the lower the ratio, the closer it is. The results indicate that the NMT model learned more latent rules from the train data set than the hand-crafted normalization rules, while having a higher BLEU score.

Against normalized reference	BLEU	Ratio
Rule-based Baseline (Normalizer)	47.60	86.58%
Best NMT model	62.68	-
Best NMT model + Normalizer	71.33	30.81%
PharmMT	71.14	36.01%
Reference (upper bound)	82.48	42.36%

Table 10: Performance against normalized reference

6.4 Limitation of BLEU and METEOR

Automated metrics such as BLEU and METEOR can only evaluate the translation results using a linguistic, token-level approach, but fail to capture the nuanced semantic-level information. However, in prescription directions, different words contain unequal useful information, and hence should be re-weighted during the evaluation process. As we noted in Sec. 3.2, consistency of key information is vital for patient safety. For example, in the translation shown in bottom of Table 6, both machine translation outputs NMT₁ and NMT₂ have only one token different from the reference. But, NMT₁ is labeled as ‘Correct’ in the manual evaluation while NMT₂ is labeled as ‘Wrong’ because of a serious error on dosage. However, both translations got similar BLEU and METEOR scores. In the future, we will focus on improving information consistency while maintaining high model performance.

7 Conclusion

We proposed and developed a machine translation-based approach, called PharmMT, to simplify e-prescription directions. We systematically evaluated the contribution of each stage and the overall approach over a large mail-order pharmacy data corpus. Our results showed that an ensemble model with a bi-directional LSTM encoder and an LSTM decoder, trained over a clinical-domain word embedding representations, achieved the best overall BLEU score of 60.27. NER-based numeric check and graceful back-off ensure information consistency and the normalization stage helps generate patient-friendly directions. Qualitative evaluation by domain experts showed that 94.3% of the simplified directions could be used as-is or with minimal changes. These results indicate that the proposed approach could be deployed in practice to automate the simplification of prescription directions.