Volume 2 Supplement 3
A cascade of classifiers for extracting medication information from discharge summaries
© Halgrim et al; licensee BioMed Central Ltd. 2011
Published: 14 July 2011
Extracting medication information from clinical records has many potential applications, and recently published research, systems, and competitions reflect an interest therein. Much of the early extraction work involved rules and lexicons, but more recently machine learning has been applied to the task.
We present a hybrid system consisting of two parts. The first part, field detection, uses a cascade of statistical classifiers to identify medication-related named entities. The second part uses simple heuristics to link those entities into medication events.
The system achieved performance that is comparable to other approaches to the same task. This performance is further improved by adding features that reference external medication name lists.
This study demonstrates that our hybrid approach outperforms purely statistical or rule-based systems. The study also shows that a cascade of classifiers works better than a single classifier in extracting medication information. The system is available as is upon request from the first author.
Narrative clinical records store patient medical information, and extracting this information is an important problem with practical application . In this work we describe a system for extracting detailed medication information from hospital discharge summaries using a combination of rules and statistical learning.
Until recently, much of the work done on extracting medication information from clinical documents involved rules and lexicons. Gold et al. used a set of parsing rules formatted as regular expressions and a drug name lexicon , while Xu et al. filled a semantic representation model using lexicon lookups, regular expressions, and disambiguation rules . While convenient in the absence of a large corpus of annotated data, such rule-based systems can be time-consuming to build and difficult to manage . More recently, machine learning has been applied to the task: Patrick and Li used a conditional random fields (CRF) named entity identifier and a support vector machine (SVM) relationship classifier , Tikk and Solt also employed CRF to finding named entities , and Li et al. worked with AdaBoost and CRF .
Maximum Entropy (MaxEnt) is a machine learning algorithm that, in the biomedical domain, has been used to identify personally identifiable information  and assign gene function codes to genes . In information extraction, Chieu and Ng used it to extract succession management templates . As far as we know, MaxEnt has not been applied to medication information extraction in the clinical domain.
The problem task
A sample discharge summary excerpt and the corresponding entries in the gold standard
Excerpt of Discharge Summary
55 the patient noted that he had a recurrence of this
56 vague chest discomfort as he was sitting and
57 talking to friends. He took a sublingual
58 Nitroglycerin without relief.
65 Flomax ( Tamsulosin ) 0.4 mg, po, qd,...
m=“Nitroglycerin” 58:0 58:0 ||do=“nm”||
mo=“sublingual” 57:6 57:6 ||f=“nm” ||du=“nm” ||
r=“vague chest discomfort” 56:0 56:2 ||
m="flomax ( tamsulosin )" 65:0 65:3||do="0.4 mg"
65:4 65:5||mo="po" 65:6 65:6||f="qd" 65:7
In this paper we present our approach to this task. A pre-processor generates section information via regular expressions and part-of-speech tags from the Stanford tagger . The next step is the system’s core: a cascade of statistical classifiers that identify medication fields. Simple rules then form entries from these fields.
The data for training and evaluating our methods came from the 2009 i2b2 challenge . The challenge organizers released 696 summaries for system development; a gold standard for entries was provided for 17 of them. The University of Sydney team  annotated 145 of the 696 summaries and generously shared their annotations with i2b2 after the challenge for future research. We obtained and used 110 of those annotations as our training set and the remaining 35 as our development set. After the challenge, 251 more summaries were annotated by the challenge participants, and those summaries formed the final test set on which our system was evaluated.
The data sets used in our experiments
# of Summaries
# of Entries
# of Fields
# of Name
# of Dose
# of Freq
# of Mode
# of Duration
# of Reason
We developed a hybrid system with three processing steps: (1) a pre-processing step, (2) a field detection step that identifies the six fields, and (3) a field linking step that links fields together to form entries. The second step is a statistical system, whereas the other two steps are rule-based. The second step was the main focus of this study. The entire system was first presented at the 2010 Louhi Workshop , where the authors were invited for the special issue of this journal.
In addition to common processing steps such as part-of-speech (POS) tagging, our pre-processor includes a section segmenter that breaks discharge summaries into sections. Discharge summaries tend to consist of sections such as “ADMIT DIAGNOSIS”, “PAST MEDICAL HISTORY”, and “DISCHARGE MEDICATIONS”. Knowing section boundaries is important for the task because, according to the i2b2 challenge annotation guidelines for creating the gold standard, medications occurring under certain sections (e.g., “FAMILY HISTORY” and “ALLERGIES”) were to be excluded from the system output. Knowing the sections could also be useful for field detection and linking. For example, the ‘DISCHARGE MEDICATIONS’ section is more likely to contain medications in a list than medications embedded in narrative text.
The set of sections and the exact spelling of section headings vary across discharge summaries. The section segmenter uses a regular expression (a line starting with a sequence of capitalized letters followed by a colon) to collect potential section headings from the training data. The headings whose frequencies are higher than a threshold are used to identify section boundaries in the discharge summaries.
This step consists of three modules: find_name, which finds medication names, context_type, which determines whether each identified medication name appears in narrative text or in a list of medications, and find_others, which detects the five non-name field types. For all three modules we use the Maximum Entropy (MaxEnt) learner in the MALLET package  because the training time for MaxEnt can be shorter than more sophisticated algorithms such as CRF . For find_name and find_others, we follow the common practice of treating named entity (NE) detection as a sequence labeling task with the Inside-Outside-Beginning (IOB) tagging scheme; that is, each token in the input is tagged with B-x (beginning an NE of type x), I-x (inside an NE of type x) and O (outside any NE).
The find_name module
As this module identifies medication names only, the tagset under the IOB scheme has three tags: B-m for beginning of a name, I-m for inside a name, and O for outside.
Various features are used for this module, which we group into four types:
• (F1) includes word n-gram features (n=1,2,3). For instance, the bigram wi-1 wi looks at the bigram consisting of the previous word and the current word.
• (F2) contains features of properties of the current word and its neighbors (e.g., their POS tags, affixes, lengths, containing section, capitalization, etc.)
• (F3) checks the IOB tags of previous words
• (F4) contains features that check whether an n-gram in the text appears as part of a medication name in some medication name lists.
For (F4) we used two medication name lists. The first list consists of medication names from the training data and is the only list used in set F4a. The second list includes drug names from the FDA National Drug Code Directory (http://www.accessdata.fda.gov/scripts/cder/ndc/) and is used to test whether features that check an external resource improve performance. Feature set F4b uses both lists.
The context_type module
This module is a binary classifier that determines whether a medication name occurs in a list or narrative context. Features used by this module include the section name as identified by the pre-processing step, the number of commas and words on the line, the medication name itself and its position on the line, and nearby words.
The find_others module
This module complements the find_name module and uses eleven IOB tags to identify five non-name fields. The feature set used in this module is similar to the one used in find_name, but some features in (F2) and (F4) are modified to suit the non-name fields. For instance, one feature that was not present in find_name checks whether a word fits a common pattern for dosage. In addition, some features in find_others look at the output of previous modules, like the location of nearby medication names, as this information can be provided by the find_name module at test time.
The final step is to form entries by associating each medication name with its related fields. Our current implementation uses simple heuristics. First, for each non-name field the closest prior and subsequent name fields are identified. Second, each non-name field is linked to one of those two name fields. In most cases, the non-name field is linked to the prior name field, but if the distance to the subsequent name field is shorter than the distance to the prior name field by more than two lines, we link the non-name field to the subsequent name field. Third, the (name, non-name) pairs are assembled into entries with a few rules that apply if more than one non-name field of the same type is linked to the same name field. More information about the modules, including the features and the linking rules, is available in .
In this section, we report our system’s performance on the development and test sets.
We use two sets of evaluation metrics: horizontal and vertical. Horizontal metrics measure performance at the entry level, whereas vertical metrics measure performance at the field level. Both metrics compare fields between the system output and the gold standard for an exact match. A field in the system output exactly matches a field in the gold standard if the two fields’ spans are identical and they have the same field type . The primary metric for the i2b2 challenge was horizontal F-score, which is the metric we use in this section unless otherwise specified.
To determine whether the difference between two systems’ performances is statistically significant, we use approximate randomization tests . Given two systems that we would like to compare, we first calculate the difference between horizontal F-scores. Then two pseudo-system outputs are generated by swapping (at 0.5 probability) the two system outputs for each discharge summary. These new pseudo-sets are scored as normal, and the difference between F-scores calculated. If the difference between F-scores of these pseudo-outputs is no less than the original F-score difference, a counter, i, is increased by one. This process is repeated n=10,000 times, and the p-value of the significance is equal to (i+1)/(n+1). If the p-value is smaller than a predefined threshold (e.g., 0.05), we conclude that the difference between the two systems is statistically significant. A conservative statistical correction (Bonferroni) was used to adjust for multiple significance comparisons.
Performance of the field detection step
The performance of field detection on the development set
When making the “narrative” vs. “list” distinction, the accuracy of context_type is 95.4%. In contrast, the accuracy of the baseline (which assigns a “list” context to each medication name) is only 55.6%.
Performance of the field linking step
The performance of the field linking step on the development set
Effect of feature sets
To test the effect of feature sets on system performance, we trained the find_name and find_others modules with different feature sets. The models were trained on the training set and the system was tested on the development set.
System performance on the development set with different feature sets
Results on the test data
System performance on the test set
As mentioned, the results for “duration” and “reason” are the lowest of all fields, which was also the case for all the participating systems in the challenge . Those two fields are also the most difficult for humans to annotate, as indicated by their low inter-annotator agreement . One possible reason for these fields’ difficulty is that their content varies considerably more than that of “mode” and “frequency” . Another possibility is that, because they are longer and have more variability in their length than other fields, it is more difficult to locate their exact boundaries .
The results shown in Table 4 are intriguing. The linking rules appear to be adequate when given perfect input, but perform worse when operating on the imperfect input from the system’s field detection module. It is unclear how much of the drop in performance is due to the rules themselves and how much is due to the limiting factor of the imperfect fields. One way to explore this in future work would be a manual effort to construct the best possible set of entries given the system-defined fields and evaluate those entries against the gold standard.
Effect of training data size
The figure illustrates that, as the training data size increases, the horizontal F-score with both feature sets improves. In addition, the external list is most helpful when the training data size is small, as indicated by the decreasing gap between the two curves.
Cascade vs. find_all
Using three separate modules for field detection allows each one to use the features most appropriate for it. In addition, later modules can use features based on the output of previous modules. However, a potential downside is errors propagating through the cascade. An alternative is to use a single module to detect all six field types.
We built and tested such an alternative, which we call find_all. This module eliminates find_name and context_type. It finds medication names by adding two more class labels to find_others: B-m and I-m. Thus it is a 13-way MaxEnt classifier that can find all six field types in one pass through the text.
Interestingly, when 10% of the training set is used for training, find_all has a higher F-score than the cascading approach, although the difference is not statistically significant at p≤0.05. As more data is used for training, the cascade outperforms find_all, and the difference between the two is statistically significant at p≤0.05 when at least 50% of the training data is used. One possible explanation for this phenomenon is that as more training data becomes available, the early modules in the cascade make fewer errors; as a result, the disadvantage of potential error propagation in the cascading approach is outweighed by the advantage that the later modules can use features that check the output of the earlier modules.
I2b2 challenge entrants as benchmark
Benchmark performances of the top five i2b2 systems on the test set
A caveat of comparing Tables 6 and 7 is that time, availability of training data, and differences in available resources make it difficult to compare these systems to one another. First, as non-entrants in the challenge, we had more time to work on our system than the other systems cited here. Mork et al. report that their entry into the challenge used simple rules and lookup-lists due to time constraints . Second, there was a disparity in the amount of data used. While teams were allowed to annotate their own training set, only one team in the top five did: the University of Sydney team . This disparity in data may also explain why, of the top five performing systems, only one used any kind of machine learning. As the University of Sydney graciously shared their data, we were able to emphasize machine learning in our approach. In fact, both the Spasić et al.  and Tikk and Solt  teams reported that they implemented a rule-based system with lexicons because of the small amount of training data provided. Finally, teams were allowed to use any resource, including existing systems and lexicons unavailable to the general public. Doan et al. applied their existing rule-based medication extraction system to the problem and placed second in the challenge . These variations in resources made the challenge similar to the so-called open-track challenge in the general NLP field and complicate head-to-head comparisons.
We present a hybrid system for medication information extraction. It is built around a series of cascading MaxEnt classifiers for field detection. Its performance compares favorably to systems approaching the same task with rules and other machine learning algorithms. Incorporating additional resources as features improves performance. Given enough training data, the cascade system outperforms a single classifier that finds all fields at once. In the future, we plan to try to improve scores on the “duration” and “reason” fields by adding more specialized classifiers. We also plan to replace the rule-based linking module with a statistical linker to improve results.
This work was supported in part by US DOD grant N00244-091-0081 and NIH Grants 1K99LM010227-0110, 7R00LM010227-03, U54LM008748, and T15LM007442-06. We also thank the anonymous reviewers for helpful comments.
This article has been published as part of Journal of Biomedical Semantics Volume 2 Supplement 2, 2011: Proceedings of the Second Louhi Workshop on Text and Data Mining of Health Documents. The full contents of the supplement are available online at http://www.jbiomedsem.com/supplements/2/S3.
- Levin MA, Krol M, Doshi AM, Reich DL: Extraction and mapping of drug names from free text to a standardized nomenclature. AMIA Annual Symposium Proceedings: 10-14 November 2007; Chicago. 2007, 438-442.Google Scholar
- Gold S, Elhadad N, Zhu M, Cimino JJ, Hripcsak G: Extracting structured medication event information from discharge summaries. AMIA Annual Symposium Proceedings: 8-12 November 2008; Washington. 2008, 237-241.Google Scholar
- Xu H, Stenner SP, Doan S, Johnson KB, Waitman LR, Denny JC: MedEx: a medication information extraction system for clinical narratives. Journal of the American Medical Informatics Association. 2010, 17: 19-24. 10.1197/jamia.M3378.View ArticleGoogle Scholar
- Taira RK, Soderland SG: A statistical natural language processor for medical reports. Proceedings of the AMIA Symposium: 6-8 November 1999; Washington. Edited by: Nancy M. Lorenzi. 1999, Hanley & Belfus, Inc, 970-974.Google Scholar
- Patrick J, Li M: High accuracy information extraction of medication information from clinical notes: 2009 i2b2 medication extraction challenge. Journal of the American Medical Informatics Association. 2010, 17: 524-527. 10.1136/jamia.2010.003939.View ArticleGoogle Scholar
- Tikk D, Solt I: Improving textual medication extraction using combined conditional random fields and rule-based systems. Journal of the American Medical Informatics Association. 2010, 17: 540-544. 10.1136/jamia.2010.004119.View ArticleGoogle Scholar
- Li Z, Liu F, Antieau L, Cao Y, Yu H: Lancet: a high precision medication event extraction system for clinical text. Journal of the American Medical Informatics Association. 2010, 17: 563-567. 10.1136/jamia.2010.004077.View ArticleGoogle Scholar
- Taira RK, Bui AAT, Kangarloo H: Identification of patient name references within medical documents using semantic selectional restrictions. Proceedings of the AMIA Symposium: 9-13 November 2002; San Antonio. Edited by: Isaac S. Kohane. 2002, Hanley & Belfus, Inc, 757-761.Google Scholar
- Raychaudhuri S, Chang JT, Sutphin PD, Altman RB: Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. Genome Research. 2002, 12: 203-214. 10.1101/gr.199701.View ArticleGoogle Scholar
- Chieu HL, Ng HT: A maximum entropy approach to information extraction from semi-structured and free text. Proceedings of the Eighteenth National Conference on Artificial Intelligence: 28 July – 1 August 2002; Edmonton. 2002, 786-791.Google Scholar
- Stanford Log-linear Part-Of-Speech Tagger. [http://nlp.stanford.edu/software/tagger.shtml]
- Uzuner Ö, Solti I, Cadag E: Extracting medication information from clinical text. Journal of the American Medical Informatics Association. 2010, 17: 514-518. 10.1136/jamia.2010.003947.View ArticleGoogle Scholar
- Halgrim SR, Xia F, Solti I, Cadag E, Uzuner Ö: Extracting medication information from discharge summaries. Proceedings of the NAACL HLT 2010 Second Louhi Workshop on Text and Data Mining of Health Documents: 5 June 2010; Los Angeles. Edited by: Association for Computational Linguistics (ACL). 2010, 61-67.Google Scholar
- MAchine Learning for LanguagE Tooklit. [http://mallet.cs.umass.edu/]
- Lafferty J, McCallum A, Pereira F: Conditional random fields: probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning 2001 (ICML-2001): 28 June – 1 July; Williamstown. Edited by: Carla E. Brodley and Andrea P. Danyluk. 2001, Morgan Kaufmann, 282-289.Google Scholar
- Halgrim SR: A pipeline machine learning approach to biomedical information extraction. Master’s thesis. 2009, University of Washington, Department of LinguisticsGoogle Scholar
- Noreen EW: Computer Intensive Methods for Testing Hypotheses: An Introduction. 1989, John Wiley & SonsGoogle Scholar
- Uzuner Ö, Solti I, Xia F, Cadag E: Community annotation experiment for ground truth generation for the i2b2 medication challenge. Journal of the American Medical Informatics Association. 2010, 17: 519-523. 10.1136/jamia.2010.004200.View ArticleGoogle Scholar
- Doan S, Bastarache L, Klimkowski S, Denny JC, Xu H: Integrating existing natural language processing tools for medication extraction from discharge summaries. Journal of the American Medical Informatics Association. 2010, 17: 528-531. 10.1136/jamia.2010.003855.View ArticleGoogle Scholar
- Spasić I, Sarafraz F, Keane JA, Nenadić G: Medication information extraction with linguistic pattern matching and semantic rules. Journal of the American Medical Informatics Association. 2010, 17: 532-535. 10.1136/jamia.2010.003657.View ArticleGoogle Scholar
- Mork JG, Bodenreider O, Demner-Fushman D, Doğan RI, Lang FM, Lu Z, Névéol A, Peters L, Shooshan SE, Aronson AR: Extracting Rx information from clinical narrative. Journal of the American Medical Informatics Association. 2010, 17: 536-539. 10.1136/jamia.2010.003970.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.