A cascade of classifiers for extracting medication information from discharge summaries

Background Extracting medication information from clinical records has many potential applications, and recently published research, systems, and competitions reflect an interest therein. Much of the early extraction work involved rules and lexicons, but more recently machine learning has been applied to the task. Methods We present a hybrid system consisting of two parts. The first part, field detection, uses a cascade of statistical classifiers to identify medication-related named entities. The second part uses simple heuristics to link those entities into medication events. Results The system achieved performance that is comparable to other approaches to the same task. This performance is further improved by adding features that reference external medication name lists. Conclusions This study demonstrates that our hybrid approach outperforms purely statistical or rule-based systems. The study also shows that a cascade of classifiers works better than a single classifier in extracting medication information. The system is available as is upon request from the first author.


Background
Narrative clinical records store patient medical information, and extracting this information is an important problem with practical application [1]. In this work we describe a system for extracting detailed medication information from hospital discharge summaries using a combination of rules and statistical learning.

Related work
Until recently, much of the work done on extracting medication information from clinical documents involved rules and lexicons. Gold et al. used a set of parsing rules formatted as regular expressions and a drug name lexicon [2], while Xu et al. filled a semantic representation model using lexicon lookups, regular expressions, and disambiguation rules [3]. While convenient in the absence of a large corpus of annotated data, such rule-based systems can be time-consuming to build and difficult to manage [4]. More recently, machine learning has been applied to the task: Patrick and Li used a conditional random fields (CRF) named entity identifier and a support vector machine (SVM) relationship classifier [5], Tikk and Solt also employed CRF to finding named entities [6], and Li et al. worked with AdaBoost and CRF [7].
Maximum Entropy (MaxEnt) is a machine learning algorithm that, in the biomedical domain, has been used to identify personally identifiable information [8] and assign gene function codes to genes [9]. In information extraction, Chieu and Ng used it to extract succession management templates [10]. As far as we know, MaxEnt has not been applied to medication information extraction in the clinical domain.

The problem task
For this work we are interested in the automatic extraction of information about medications that a patient takes. Specifically, we extract the following fields from hospital discharge summaries: names of medications (m), dosages (do), modes (mo), frequencies (f), durations (du), and reasons (r) for taking these medications. We refer to the medication field as the name field and the other five fields as the non-name fields. All non-name fields should be linked to exactly one name field in the system output. A name field and all the non-name fields that link to it form one or more entries, each of which corresponds to a medication event. An entry appears in either a list of medications ("list") or in narrative text ("narrative"). Table 1 shows an excerpt from a discharge summary and the corresponding entries in the gold standard. The first entry appears in narrative text, and the second in a medication list.
In this paper we present our approach to this task. A pre-processor generates section information via regular expressions and part-of-speech tags from the Stanford tagger [11]. The next step is the system's core: a cascade of statistical classifiers that identify medication fields. Simple rules then form entries from these fields.
i2b2 after the challenge for future research. We obtained and used 110 of those annotations as our training set and the remaining 35 as our development set. After the challenge, 251 more summaries were annotated by the challenge participants, and those summaries formed the final test set on which our system was evaluated.
The sizes of the data sets used in our experiments are shown in Table 2. The average number of entries and fields vary across the sets because the summaries in the test set were chosen randomly from a set of 547 held-out summaries, whereas the University of Sydney team chose to annotate the longest summaries in the released set.

Methods
We developed a hybrid system with three processing steps: (1) a pre-processing step, (2) a field detection step that identifies the six fields, and (3) a field linking step that links fields together to form entries. The second step is a statistical system, whereas the other two steps are rule-based. The second step was the main focus of this study. The entire system was first presented at the 2010 Louhi Workshop [13], where the authors were invited for the special issue of this journal.

Pre-processing
In addition to common processing steps such as part-of-speech (POS) tagging, our preprocessor includes a section segmenter that breaks discharge summaries into sections. Discharge summaries tend to consist of sections such as "ADMIT DIAGNOSIS", "PAST MEDICAL HISTORY", and "DISCHARGE MEDICATIONS". Knowing section boundaries is important for the task because, according to the i2b2 challenge annotation guidelines for creating the gold standard, medications occurring under certain sections (e.g., "FAMILY HISTORY" and "ALLERGIES") were to be excluded from the system output. Knowing the sections could also be useful for field detection and linking. For example, the 'DISCHARGE MEDICATIONS' section is more likely to contain medications in a list than medications embedded in narrative text.
The set of sections and the exact spelling of section headings vary across discharge summaries. The section segmenter uses a regular expression (a line starting with a sequence of capitalized letters followed by a colon) to collect potential section headings from the training data. The headings whose frequencies are higher than a threshold are used to identify section boundaries in the discharge summaries.

Field detection
This step consists of three modules: find_name, which finds medication names, context_type, which determines whether each identified medication name appears in narrative text or in a list of medications, and find_others, which detects the five The numbers in parentheses are the average numbers of entries or fields per discharge summary.
non-name field types. For all three modules we use the Maximum Entropy (MaxEnt) learner in the MALLET package [14] because the training time for MaxEnt can be shorter than more sophisticated algorithms such as CRF [15]. For find_name and find_others, we follow the common practice of treating named entity (NE) detection as a sequence labeling task with the Inside-Outside-Beginning (IOB) tagging scheme; that is, each token in the input is tagged with B-x (beginning an NE of type x), I-x (inside an NE of type x) and O (outside any NE).
The find_name module For (F4) we used two medication name lists. The first list consists of medication names from the training data and is the only list used in set F4a. The second list includes drug names from the FDA National Drug Code Directory (http://www.accessdata.fda.gov/scripts/cder/ndc/) and is used to test whether features that check an external resource improve performance. Feature set F4b uses both lists.

The context_type module
This module is a binary classifier that determines whether a medication name occurs in a list or narrative context. Features used by this module include the section name as identified by the pre-processing step, the number of commas and words on the line, the medication name itself and its position on the line, and nearby words.

The find_others module
This module complements the find_name module and uses eleven IOB tags to identify five non-name fields. The feature set used in this module is similar to the one used in find_name, but some features in (F2) and (F4) are modified to suit the non-name fields. For instance, one feature that was not present in find_name checks whether a word fits a common pattern for dosage. In addition, some features in find_others look at the output of previous modules, like the location of nearby medication names, as this information can be provided by the find_name module at test time.

Field linking
The final step is to form entries by associating each medication name with its related fields. Our current implementation uses simple heuristics. First, for each non-name field the closest prior and subsequent name fields are identified. Second, each nonname field is linked to one of those two name fields. In most cases, the non-name field is linked to the prior name field, but if the distance to the subsequent name field is shorter than the distance to the prior name field by more than two lines, we link the non-name field to the subsequent name field. Third, the (name, non-name) pairs are assembled into entries with a few rules that apply if more than one non-name field of the same type is linked to the same name field. More information about the modules, including the features and the linking rules, is available in [16].

Results
In this section, we report our system's performance on the development and test sets.

Evaluation metrics
We use two sets of evaluation metrics: horizontal and vertical. Horizontal metrics measure performance at the entry level, whereas vertical metrics measure performance at the field level. Both metrics compare fields between the system output and the gold standard for an exact match. A field in the system output exactly matches a field in the gold standard if the two fields' spans are identical and they have the same field type [12]. The primary metric for the i2b2 challenge was horizontal F-score, which is the metric we use in this section unless otherwise specified.

Statistical significance
To determine whether the difference between two systems' performances is statistically significant, we use approximate randomization tests [17]. Given two systems that we would like to compare, we first calculate the difference between horizontal F-scores. Then two pseudo-system outputs are generated by swapping (at 0.5 probability) the two system outputs for each discharge summary. These new pseudo-sets are scored as normal, and the difference between F-scores calculated. If the difference between F-scores of these pseudo-outputs is no less than the original F-score difference, a counter, i, is increased by one. This process is repeated n=10,000 times, and the p-value of the significance is equal to (i+1)/(n+1). If the p-value is smaller than a predefined threshold (e.g., 0.05), we conclude that the difference between the two systems is statistically significant. A conservative statistical correction (Bonferroni) was used to adjust for multiple significance comparisons. Table 3 shows the vertical precision, recall, and F-score on identifying the six field types in the development set, using all 110 training files and the F1-F4b feature sets. Table 3 shows that, while the system detects most fields well, it has trouble with "duration" and "reason," and particularly with the recall of those fields. When making the "narrative" vs. "list" distinction, the accuracy of context_type is 95.4%. In contrast, the accuracy of the baseline (which assigns a "list" context to each medication name) is only 55.6%.

Performance of the field linking step
In order to evaluate the field linking step, we generated a list of unique (name, nonname) pairs from the gold standard where the name and non-name fields appear in the same entry. We then compared the fields in these pairs with the ones produced by the field linking step for exact matches and calculated precision, recall, and F-score. Table 4 shows the results of two experiments: in the gold standard input experiment, the input to the field linking step is the fields from the gold standard, which allows us to evaluate the linker directly assuming the field detection step is perfect; in the system input experiment, the input is the actual output of our system's field detection step. Both experiments were performed on the development set. This table shows that our heuristics perform well when given perfect input, but perform considerably worse when given the imperfect fields as detected by the system as input.

Effect of feature sets
To test the effect of feature sets on system performance, we trained the find_name and find_others modules with different feature sets. The models were trained on the training set and the system was tested on the development set.
The results are in Table 5. For the last two rows, the F1-F4a row uses a medication name list derived from the training data and the F1-F4b row adds the FDA's National Drug Code Directory list. The F-score difference between all adjacent rows is statistically significant at p≤0.05, except for the pair F1-F3 vs. F1-F4a. It is not surprising that using the first medication name list on top of F1-F3 does not improve the performance, as the same kind of information has already been captured by F1. The improvement of F1-F4b over F1-F4a shows that the system can incorporate additional resources and achieve a statistically significant gain. Table 6 shows the system performance on the test data. This includes the horizontal precision, recall, and F-score, as well as the vertical metrics. The system was trained on the union of the training and development data. These results are good overall, and confirm our findings on the development set that the system has difficulty finding "duration" and "reason" fields. Despite the poor performance on these fields, the vertical "all fields" scores are still closer to those of the other four fields, reflecting the sparseness of the challenging fields in the data. Gold standard input: assuming perfect input from the field detection step; System input: using the actual output of the system's field detection step.

Field detection
As mentioned, the results for "duration" and "reason" are the lowest of all fields, which was also the case for all the participating systems in the challenge [12]. Those two fields are also the most difficult for humans to annotate, as indicated by their low inter-annotator agreement [18]. One possible reason for these fields' difficulty is that their content varies considerably more than that of "mode" and "frequency" [12]. Another possibility is that, because they are longer and have more variability in their length than other fields, it is more difficult to locate their exact boundaries [16].

Field linking
The results shown in Table 4 are intriguing. The linking rules appear to be adequate when given perfect input, but perform worse when operating on the imperfect input from the system's field detection module. It is unclear how much of the drop in performance is due to the rules themselves and how much is due to the limiting factor of the imperfect fields. One way to explore this in future work would be a manual effort to construct the best possible set of entries given the system-defined fields and evaluate those entries against the gold standard.
Effect of training data size Figure 1 shows the system performance on the development set when different portions of the training set are used for training. The curve with "+" signs represents the results for F1-F4b, and the curve with circles represents the results for F1-F4a. The figure illustrates that, as the training data size increases, the horizontal F-score with both feature sets improves. In addition, the external list is most helpful when the training data size is small, as indicated by the decreasing gap between the two curves. Horizontal precision, recall, and F-score demonstrates that including all proposed feature sets and the external data lists results in the best performance. Horizontal and vertical metrics for system trained on the union of the training and development sets using all features in sets F1-F4b.

Cascade vs. find_all
Using three separate modules for field detection allows each one to use the features most appropriate for it. In addition, later modules can use features based on the output of previous modules. However, a potential downside is errors propagating through the cascade. An alternative is to use a single module to detect all six field types. We built and tested such an alternative, which we call find_all. This module eliminates find_name and context_type. It finds medication names by adding two more class labels to find_others: B-m and I-m. Thus it is a 13-way MaxEnt classifier that can find all six field types in one pass through the text. Figure 2 compares the horizontal F-score of the system using the find_all algorithm as its field detection step with that of the system with cascading modules. Both use the F1-F4b feature sets except that, since find_others uses some features that check the output of previous modules which are not available to find_all, such as the look-ahead proximity of name fields, those features have been removed from find_all. Both algorithms are trained on the training set and evaluated on the development set.
Interestingly, when 10% of the training set is used for training, find_all has a higher F-score than the cascading approach, although the difference is not statistically significant at p≤0.05. As more data is used for training, the cascade outperforms find_all, and the difference between the two is statistically significant at p≤0.05 when at least 50% of the training data is used. One possible explanation for this phenomenon is that as more training data becomes available, the early modules in the cascade make fewer errors; as a result, the disadvantage of potential error propagation in the cascading approach is outweighed by the advantage that the later modules can use features that check the output of the earlier modules.

I2b2 challenge entrants as benchmark
Strictly for purposes of providing a benchmark, we report the horizontal precision, recall, and F-score on the test set of the top five systems [5] [19][20] [21][6] that participated in the 2009 i2b2 challenge [12] in Table 7. The table shows that the performance of our system is comparable to the top systems in the i2b2 challenge.
A caveat of comparing Tables 6 and 7 is that time, availability of training data, and differences in available resources make it difficult to compare these systems to one another. First, as non-entrants in the challenge, we had more time to work on our system than the other systems cited here. Mork et al. report that their entry into the challenge used simple rules and lookup-lists due to time constraints [21]. Second, there was a disparity in the amount of data used. While teams were allowed to  annotate their own training set, only one team in the top five did: the University of Sydney team [5]. This disparity in data may also explain why, of the top five performing systems, only one used any kind of machine learning. As the University of Sydney graciously shared their data, we were able to emphasize machine learning in our approach. In fact, both the Spasić et al. [20] and Tikk and Solt [6] teams reported that they implemented a rule-based system with lexicons because of the small amount of training data provided. Finally, teams were allowed to use any resource, including existing systems and lexicons unavailable to the general public. Doan et al. applied their existing rule-based medication extraction system to the problem and placed second in the challenge [19]. These variations in resources made the challenge similar to the so-called open-track challenge in the general NLP field and complicate headto-head comparisons.

Conclusions
We present a hybrid system for medication information extraction. It is built around a series of cascading MaxEnt classifiers for field detection. Its performance compares favorably to systems approaching the same task with rules and other machine learning algorithms. Incorporating additional resources as features improves performance. Given enough training data, the cascade system outperforms a single classifier that finds all fields at once. In the future, we plan to try to improve scores on the "duration" and "reason" fields by adding more specialized classifiers. We also plan to replace the rule-based linking module with a statistical linker to improve results.