Pooling annotated corpora for clinical concept extraction
© Wagholikar et al.; licensee BioMed Central Ltd. 2013
Received: 26 June 2012
Accepted: 2 January 2013
Published: 8 January 2013
Skip to main content
© Wagholikar et al.; licensee BioMed Central Ltd. 2013
Received: 26 June 2012
Accepted: 2 January 2013
Published: 8 January 2013
The availability of annotated corpora has facilitated the application of machine learning algorithms to concept extraction from clinical notes. However, high expenditure and labor are required for creating the annotations. A potential alternative is to reuse existing corpora from other institutions by pooling with local corpora, for training machine taggers. In this paper we have investigated the latter approach by pooling corpora from 2010 i2b2/VA NLP challenge and Mayo Clinic Rochester, to evaluate taggers for recognition of medical problems. The corpora were annotated for medical problems, but with different guidelines. The taggers were constructed using an existing tagging system MedTagger that consisted of dictionary lookup, part of speech (POS) tagging and machine learning for named entity prediction and concept extraction. We hope that our current work will be a useful case study for facilitating reuse of annotated corpora across institutions.
We found that pooling was effective when the size of the local corpus was small and after some of the guideline differences were reconciled. The benefits of pooling, however, diminished as more locally annotated documents were included in the training data. We examined the annotation guidelines to identify factors that determine the effect of pooling.
The effectiveness of pooling corpora, is dependent on several factors, which include compatibility of annotation guidelines, distribution of report types and size of local and foreign corpora. Simple methods to rectify some of the guideline differences can facilitate pooling. Our findings need to be confirmed with further studies on different corpora. To facilitate the pooling and reuse of annotated corpora, we suggest that – i) the NLP community should develop a standard annotation guideline that addresses the potential areas of guideline differences that are partly identified in this paper; ii) corpora should be annotated with a two-pass method that focuses first on concept recognition, followed by normalization to existing ontologies; and iii) metadata such as type of the report should be created during the annotation process.
Development of Natural Language Processing (NLP) tools generally requires a set of annotated documents in the application domain . The annotations serve as a reference for constructing rule-based NLP systems and as a training corpus to derive machine learning models for concept extraction. However, in the clinical domain, annotated corpora are often difficult to develop due to high cost of manual annotation involving domain experts and medical practitioners, and also due to concerns for patient confidentiality . Due to high demand, such corpora have been recently created with pioneering effort of some research groups and made available to the scientific community to support studies in clinical NLP [3–5].
Availability of the annotated corpora has fostered the application of machine learning algorithms to concept extraction from clinical notes [6, 7]. Supervised machine learning taggers that achieve an accuracy of more than 80% have been developed [8, 9], given their great success for general English text  and biomedical literature [11–13]. These taggers were developed as an alternative to earlier systems that use dictionaries and rules, e.g. MetaMap , MedLEE , and SymText/MPLUS . However, machine learning methods are sensitive to the distribution of data, such as the distribution of words in the vocabulary and grammar styles, which could significantly affect the portability of a trained machine learning system across institutions and, thus the value of annotated corpora.
Given the barriers for preparing a large annotated corpus in individual institutions, consolidation of annotation efforts has the potential to advance clinical NLP. One way of leveraging existing efforts is to pool annotated corpora across institutions. Pooling of the annotations to train machine learning taggers may increase performance of the taggers . However there has been little research on associated issues. In this paper we have investigated whether pooling of similar corpora from two different sources can improve performance of resultant machine learning taggers for medical problem detection. We hope that our study will be a useful guide for facilitating reuse of annotated corpora across institutions.
There have been similar efforts to pool corpora in the biomedical domain. Johnson et al.  semi-automatically changed the format of the Protein Design Group corpus into two new formats (WordFreak and embedded XML), without altering the semantics, to increase the usage of the corpus. The process of altering the format without change in the semantics was called ‘re-factoring’. Ohta et al.  extended the annotation of their GENIA corpus to integrate the annotation style of the GENTAG corpus, which is the other prominent and widely used biomedical corpus, so that their corpus can been pooled with others following the same format. As an extension of this work, Wang et al.  pooled these corpora (and a third known as AIMED)  hoping to achieve better performance using the large corpus. However, the performance dropped by 12%. Subsequently they analyzed incompatibilities among the corpora. After removing the incompatibilities, they obtained promising performance using the pooled corpora . Recently, Huang et al.  have reported significant performance improvement of machine learning based part-of-speech (POS) taggers, by training them on pooled dataset.
We have used publicly available resources, such as the UMLS MetaThesaurus as a term dictionary, GENIA tagger for POS tagging, programs in Mallet machine learning software suite for sequence tagging and clinical corpora from the i2b2/VA challenge, to investigate the feasibility of pooling annotations for clinical concept extraction. Our current effort can potentially benefit research on development of clinical taggers at healthcare institutions, by facilitating use of annotated corpora from other institutions. In the next subsection, we briefly explain the process of annotation for readers who are new to this field.
Machine learning based taggers for detecting phrases that convey particular concepts, requires the availability of reports that have been manually marked (annotated) for the phrases. For instance, in the sentence “The patient denies any abdominal pain”, the phrase denoting a medical problem has been marked by the underline. The exercise to manually create such a set of marked reports is initiated with the development of a guideline, which defines what to mark and also how to mark. A group of human annotators then independently follow the guideline to carry out the annotation exercise.
Researchers developing a machine learning tagger at an institute have the option of training the tagger on i) the in-house set of reports that have been manually annotated, ii) reports annotated at another institution or iii) a pooled set constructed by combining i and ii. While the development of the in-house corpus requires several hundred hours of human effort and the associated expenses, the corpus from other institutions may not be portable. In this paper, we have examined the factors associated with the use of corpus from other institutions.
We trained and tested taggers on a corpus from Mayo Clinic Rochester  and a corpus from the 2010 i2b2/VA NLP challenge , and examined the effect of pooling the corpora . These corpora share the property that they were annotated for the same task of developing taggers for detecting medical problems. However the corpora were constructed with different annotation guidelines. The experiments were carried out using an existing machine learning-based tagging system, MedTagger , that participated in the 2010 i2b2/VA NLP challenge. In an earlier study, we had reported performance gain for machine learning taggers by pooling corpora across institutions and report types . The corpora used in that study were subsets of the i2b2 corpus and were annotated with the same annotation guideline.
Taggers performed the best when the training and test sets were from the same corpus. F1-scores for MCR and i2b2 cross-validations were 0.58 and 0.79 for ‘exact span’ evaluation, respectively. The higher F1-scores for i2b2 as compared to MCR could be due to a lower diversity of report types and annotations. Addition of more reports to MCR corpus might lead to improved performance for MCR. For the ‘overlap spans’, the F1-scores were 0.82 for MCR and 0.89 for i2b2. The performance patterns were similar for recall and precision for the exact and overlap span evaluations.
Performance of the taggers was poor when they were trained exclusively on reports from the other corpora. For tagger trained on i2b2 and tested on MCR, the F1 score was 0.38 for exact spans. Similarly for tagger trained on MCR and tested on i2b2, the scores was 0.40.
Supplementation of the training set with reports from other corpora decreased the performance, by 12% points for MCR and 4% points for i2b2. The greater degradation for MCR is likely due to small size of the corpus as compared to the i2b2 corpus, i.e. the proportion of data supplemented to MCR training set was much larger than that for i2b2. The pattern was similar for precision and recall, and for the ‘overlap span’ evaluation. An exception was the improvement in the recall on MCR corpus when it was supplemented with i2b2 corpus, using ‘overlap span’ evaluation.
Results suggest that the corpora are incompatible for simple pooling. In an earlier study we had reported performance gain for machine learning taggers by pooling corpora across institutions and report types . The corpora used in that study were subsets of the i2b2 corpus and were annotated with the same guideline. Contrastingly, in the current study, the corpora differ in their annotation guidelines, and also have different distributions of report types. Also the corpora sizes are different from the ones examined in the earlier study. Hence we investigated the effect of differences in the annotation guidelines, distributions of report types and corpora sizes, on the performance of taggers trained on pooled corpora, as described in the following sub-sections.
We examined the annotation guidelines to identify factors that contributed to the performance degradation of taggers trained on pooled corpora.
Annotation guidelines for the two corpora differed slightly in definition of concepts. i2b2 annotation guideline extends definition of medical problem beyond the semantic type of signs/symptoms and disease/syndrome (disorder), to include pathologic functions, mental dysfunction, molecular dysfunction, congenital abnormality, acquired abnormality, neoplastic process, and virus/bacterium. It also allows the annotator to mark medical problems not covered in the UMLS.
MCR annotation guideline defined signs/symptoms and disorders, which we mapped to the problem class. Signs/symptoms were defined as concepts that mapped to SNOMED-CT subset of semantic type signs/symptoms. Disease/syndrome had a looser definition that extended beyond the semantic types for ‘i2b2 medical problem’, to include injury or poisoning, behavioral dysfunction, cell dysfunction, experimental model of disease and anatomical abnormality but excluded virus/bacterium.
Frequency of i2b2 annotations that begin with an article
i2b2 annotations included possessive pronouns, e.g. “his cancer”, while MCR annotations did not. 3% of i2b2 annotations began with ‘his’ (174) or ‘her’ (200).
MCR annotations included concepts that were not pertaining to patients. For instance, disease names in organizational unit names, e.g. “cancer” in “cancer department”. Concepts that are not directly related to the patient were not annotated in i2b2 corpus.
The i2b2 guidelines specify that one prepositional phrase following a concept can be included in the annotation if it does not contain a markable concept and/either indicates an organ/body part. Also a preposition can be included in a concept phrase if words therein can be rearranged to express the same concept without the preposition. For example, ‘pain in chest’ is a viable concept phrase according to the i2b2 guidelines schema because it indicates body part and can be rephrased without ‘in’ as ‘chest pain’. In contrast, the text segment ‘removal of mass’ is annotated as two concept phrases as it cannot be rearranged to express the same concept without the preposition. MCR guidelines did not explicitly address this issue.
The i2b2 guidelines specify that conjunctions that denote lists are included if they occur within the modifiers or are connected by a common set of modifiers. If the portions of the lists are otherwise independent, they should not be included. For example, the text segment ‘metastases in the liver and pancreas’ is a valid concept phrase including ‘and’, while the segment ‘diarrhea, nausea and vomiting’ is not valid. The latter is annotated as three concept phrases. MCR guidelines did not explicitly address this issue.
Overlap of annotations in the corpora
Ignoring one start word
Ignoring one word
The annotations in MCR corpus were mapped to the highest level of granularity-- to the UMLS CUIs. The MCR annotations were restricted strictly to the UMLS. Consequently, these annotations can be expected to inherit the limitations of UMLS which includes lack of concept coverage. This would possibly be the reason why the taggers trained on MCR corpus and tested on i2b2 corpus showed greater degradation in performance than vice-versa. The annotation guideline for the i2b2 corpus advocated a more intuitive approach for annotation. I2b2 annotators used UMLS definitions to guide the annotation, which allowed them greater flexibility to annotate and even include phrases that were not covered in the UMLS.
Hence, to facilitate reuse of the annotations for developing machine learning models for concept recognition, we suggest the following two-step approach for annotation. Annotators should first mark the phrases that correspond to the concept of interest (perform concept recognition), and then normalize the annotations by mapping each annotation to the set of ontology nodes with the highest possible granularity. When normalization is not possible to any ontology node, the phrase should be marked as a ‘novel’ concept.
For developing machine learning applications it is critical that all the phrases that map to the concept of interest are annotated, by ensuring that even those which are not covered in the reference ontology are marked up. The first pass of ‘concept recognition’ would ensure that all the concepts are covered. The second pass of ‘normalization’ will facilitate the filtering/sub-classing of annotations for developing machine learning taggers for a particular sub-class. This two-pass annotation method will facilitate the pooling of corpus with other similar corpora. Also the ‘novel’ annotation class will be useful for adding new ontology terms.
In addition to the differences in the annotation guidelines, the performance of the taggers could be affected by the distribution of report types in the corpora. i2b2 corpus included discharge summaries and progress notes, while MCR corpus had a wider variety, since the reports were randomly selected from the EMR system for annotation. Named entities may vary in their distributions on report types. For instance the history and examination reports will have a high density of patient symptoms as compared to the progress notes that will mainly refer to the symptoms addressed by the current treatment. Also the progress notes will perhaps contain more medical terminology instead of ordinary English words reported by the patient in the history and examination reports. The distribution of the medical terminologies on the report types may also depend on the institution, as many institutions have their own report formats. However the reports in either corpus did not have meta-data about the type of report. Hence, the authors could not investigate the ‘report-type’ factor further.
The performance of taggers trained and tested on MCR corpus increases in F1-score and recall as the corpus size increases. The increase is rapid at first with increment in the corpus size, but later forms a plateau. The precision, on the other hand, is nearly unaffected by the corpus size.
Pooling with the i2b2 corpus nearly always increases the recall. However as the precision always degrades on pooling, the F1-score first increases with pooling and then degrades. A possible explanation is that the smaller subsets of MCR corpus are deficient in all the annotation patterns, and when these are supplemented by the i2b2 curated corpus, there is an improvement in recall.
The improvement in recall for smaller sizes of MCR subsets is greater than the degradation in precision that occurs due to pooling, which translates to an improvement in F1-score. When the MCR subset size crosses a threshold, the improvement in recall cannot surpass the degradation in precision, which lowers the F1-score. This result leads to a hypothesis that pooling with a compatible corpus from another institution may be beneficial and an economically favorable alternative to extending the in-house annotated corpus, only when the in-house corpus is below a critical size. However our analysis is limited to a single observation and further studies on other corpora are needed to investigate the combined effect of differences in the annotation guidelines, distributions of report type and sizes.
The tagger performance with the curated i2b2 corpus was greater than with the original corpus for all subsets of MCR corpus. The simple approach of automated annotation, described earlier increased the threshold of the MCR subset size where the F1-score dips on pooling.
In summary, simple pooling of corpora was overall found to reduce the tagger performance. We examined the annotation guidelines for the corpora to delineate several inconsistencies that include concept definition, articles, possessive pronouns, unrelated concepts, prepositional phrases and conjunctions. Rectification of a subset of the annotation differences using an automatic approach reduced the performance degradation that occurred on pooling. The effect of distribution of report types could not be studied as the corpora were not annotated for report type. The effect of pooling was found to depend on the corpus size, as pooling was found to improve tagger performance for smaller subsets of the MCR corpus. This result suggests that pooling with corpora from another institution may be beneficial when the in-house corpus is below a critical size. Further studies on different corpora are needed to elucidate the relationship between the above mentioned factors and performance of taggers trained on pooled corpora. The investigation of these relationships would be a useful guide for researchers to develop machine learning taggers for clinical text.
We investigated whether pooling of corpora from two different sources, can improve performance and portability of resultant machine learning taggers. The effect of pooling local and foreign corpora, is dependent on several factors that include compatibility of annotation guidelines, distribution of report types and corpus size. Simple automatic methods to rectify some of the guideline differences can be useful to facilitate pooling. Practically useful machine taggers can be possibly developed by annotating a small local corpus, and pooling it with a large similar corpus, available for reuse from another institution. The benefits of pooling diminish as more local annotations are created. Our findings need to be confirmed with further studies using different corpora and machine taggers.
Studies on different corpora are needed to elucidate the relationship between the above mentioned factors and performance of taggers trained on pooled corpora. We plan to investigate whether weighting of features for the machine learning tagger and filtering of the annotations using a dictionary lookup, can improve the tagger performance on pooling the corpora. Another interesting direction of investigation would be to train machine taggers separately on local and foreign corpora and then to combine the taggers using machine learning.
We suggest that future initiatives for clinical annotations should consider guideline factors delineated in this paper for development of the annotation guidelines, so that their annotation effort can be utilized by others. Moreover the guidelines should also include instructions to annotate metadata about the reports, so that reports of the same type from different corpora can be readily pooled for enhancing machine learning based taggers. The authors are aware of an effort in this direction , but there needs to be consensus for wider utilization of the standard for annotation of new corpora. The annotation groups involved in clinical research should come together to develop a standard annotation guideline that facilitates reuse of annotation efforts. We also suggest a two-pass annotation method that focuses first on concept recognition, followed by normalization to existing ontologies, to facilitate the pooling of corpora.
Summary statistics for the corpora
% of tokens included in concept annotation
The first corpus consisted of 349 clinical reports from the ‘2010 i2b2/VA challenge on concept assertions and relations in clinical text’  that were provided to the participants for training. Partners Healthcare, Beth Israel Deaconess Medical Center and University of Pittsburgh Medical Center contributed discharge summaries for this corpus, and University of Pittsburgh Medical Center also contributed progress reports. This corpus was annotated for patient medical problems (signs/symptoms and disorders), treatments and tests.
The second corpus had 160 clinical notes from Mayo Clinic Rochester (MCR) [24, 28]. These were annotated for signs/symptoms, disorders, medications and procedures. The annotation class ‘problem’ from the first corpus was equivalent to the combination of classes -- signs/symptoms and disorders in the second corpus and we carried out experiments with reference to this class.
The experiments reported in this paper were carried out using an existing tagging system, MedTagger that participated in the 2010 i2b2/VA NLP challenge. This is an adaptation of the BioTagger-GM system that was originally developed to identify gene/protein names in biomedical literature . The pipeline of this system consisted of dictionary lookup, part of speech (POS) tagging and machine learning for named entity prediction and concept extraction.
We used the UMLS MetaThesaurus  and a collection of words used in a clinical vocabulary viewer  as the domain dictionary. The input text and dictionary were normalized to facilitate flexible matching. The normalization process included (a) converting words to base form using the UMLS SPECIALIST lexicon, (b) changing letters to lower case, (c) removing punctuation symbols, and (d) converting digits and Greek letters to 9 and ‘G’ respectively. The dictionary lookup tagged all phrase occurrences, including overlapping phrases.
We used GENIA tagger for labeling parts of speech to all input tokens. GENIA tagger  is based on maximum entropy models trained on biomedical text as well as generic English text.
Using the dictionary lookup and POS tagging results, we derived a set of features for each token. These were collated with other commonly used features for named entity recognition, such as words, word affixes and word shapes. The features were fed to a sequence tagger using a conditional random field (CRF) model  with a token window size of five.
Two kinds of evaluations were preformed: ‘exact span’ matching and ‘overlap span’ matching. In ‘exact span’ matching, the annotations were counted as matches if begin and end spans matched. In case of ‘overlap span’ evaluation the annotations were counted as matches, if there was any overlap in the span ranges. The overlap span is a more lenient measure of performance. It was expected to be more useful in this study, because a wide variation of phrases was anticipated between the corpora, as the corpora originated from different institutions and used different annotation guidelines.
Natural language processing
Electronic medical record
MAYO clinic Rochester
Conditional random field
Part of speech.
We are thankful to the people who developed and made available the machine learning tools: GENIA tagger and Mallet, terminology services: UMLS and the labeled corpora used in this study. The i2b2 corpus of deidentified clinical reports used in this research were provided by the i2b2 National Center for Biomedical Computing funded by U54LM008748 and were originally prepared for the Shared Tasks for Challenges in NLP for Clinical Data organized by Dr. Ozlem Uzuner, i2b2 and SUNY. We are also thankful for the pioneering effort of the Mayo NLP team, for developing the resources which made the current study possible. The authors are thankful for the constructive comments of the reviewers.
This study was supported by National Science Foundation ABI:0845523, Strategic Health IT Advanced Research Projects (SHARP) award (90TR0002) to Mayo Clinic from Health and Human Services (HHS) and National Library of Medicine (NLM:1K99LM011389, 5R01LM009959-02) grants.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.