Assessment of NER solutions against the first and second CALBC Silver Standard Corpus
- Dietrich Rebholz-Schuhmann1,
- Antonio Jimeno Yepes1,
- Chen Li1,
- Senay Kafkas1,
- Ian Lewin1,
- Ning Kang2,
- Peter Corbett3,
- David Milward3,
- Ekaterina Buyko4,
- Elena Beisswanger4,
- Kerstin Hornbostel4,
- Alexandre Kouznetsov5,
- René Witte6,
- Jonas B Laurila5,
- Christopher JO Baker5,
- Cheng-Ju Kuo7,
- Simone Clematide8,
- Fabio Rinaldi8,
- Richárd Farkas9,
- György Móra9,
- Kazuo Hara10,
- Laura I Furlong11,
- Michael Rautschka11,
- Mariana Lara Neves12,
- Alberto Pascual-Montano12,
- Qi Wei13,
- Nigel Collier13,
- Md Faisal Mahbub Chowdhury14,
- Alberto Lavelli14,
- Rafael Berlanga15,
- Roser Morante16,
- Vincent Van Asch16,
- Walter Daelemans16,
- José Luís Marina17,
- Erik van Mulligen2,
- Jan Kors2 and
- Udo Hahn4
© Rebholz-Schuhmann et al; licensee BioMed Central Ltd. 2011
Published: 6 October 2011
Competitions in text mining have been used to measure the performance of automatic text processing solutions against a manually annotated gold standard corpus (GSC). The preparation of the GSC is time-consuming and costly and the final corpus consists at the most of a few thousand documents annotated with a limited set of semantic groups. To overcome these shortcomings, the CALBC project partners (PPs) have produced a large-scale annotated biomedical corpus with four different semantic groups through the harmonisation of annotations from automatic text mining solutions, the first version of the Silver Standard Corpus (SSC-I). The four semantic groups are chemical entities and drugs (CHED), genes and proteins (PRGE), diseases and disorders (DISO) and species (SPE). This corpus has been used for the First CALBC Challenge asking the participants to annotate the corpus with their text processing solutions.
All four PPs from the CALBC project and in addition, 12 challenge participants (CPs) contributed annotated data sets for an evaluation against the SSC-I. CPs could ignore the training data and deliver the annotations from their genuine annotation system, or could train a machine-learning approach on the provided pre-annotated data. In general, the performances of the annotation solutions were lower for entities from the categories CHED and PRGE in comparison to the identification of entities categorized as DISO and SPE. The best performance over all semantic groups were achieved from two annotation solutions that have been trained on the SSC-I.
The data sets from participants were used to generate the harmonised Silver Standard Corpus II (SSC-II), if the participant did not make use of the annotated data set from the SSC-I for training purposes. The performances of the participants’ solutions were again measured against the SSC-II. The performances of the annotation solutions showed again better results for DISO and SPE in comparison to CHED and PRGE.
The SSC-I delivers a large set of annotations (1,121,705) for a large number of documents (100,000 Medline abstracts). The annotations cover four different semantic groups and are sufficiently homogeneous to be reproduced with a trained classifier leading to an average F-measure of 85%. Benchmarking the annotation solutions against the SSC-II leads to better performance for the CPs’ annotation solutions in comparison to the SSC-I.
Biomedical text mining (TM) has developed into a bioinformatics discipline leading to the development IT methods that deliver accurate results from an automatic literature analysis into bioinformatics research. This research work requires the development of benchmark data sets containing annotations and thereafter the assessment of existing TM solutions against these corpora. A number of challenges have been proposed to achieve this goal: BioCreAtive I and II, JNLPBA and others [1–5]. In all these approaches, the organisers deliver a set of manually annotated documents and ask the challenge participants (CPs) to reproduce the results with their automatic methods. The annotated corpora are provided to the public after the challenge is closed and all the results are documented and published in a scientific manuscript.
The first CALBC Challenge is similar in the sense that the project partners (PPs) of the CALBC project also provided an annotated corpus to the CPs of the first CALBC challenge to reproduce the annotations with automatic means. On the other side, the first CALBC Challenge was different to the before-mentioned challenges with regards to the following modifications: (1) the annotated corpus has been generated automatically and not manually (Silver Standard Corpus, SSC-I), and (2) the size of the SSC-I is significantly bigger than the corpora mentioned produced for the other challenges, i.e. the annotated corpus contains 50,000 Medline abstracts for training and the corpus for annotation consists of 100,000 test documents. This difference in size requires that all assessment is performed fully automatically, that the CPs apply annotation solutions that can cope with such a large-scale corpus and that the assessment solutions can evaluate the contributions in a short period of time. The automatic annotation of the corpus also requires new solutions to integrate the contributions from different automatic annotation solutions into a single corpus. This process will be called “harmonisation” and refers to methods that measure the agreement between the boundaries from different annotation solutions to filter out entity boundaries that fulfil consensus criteria. Overall these annotations should have the characteristic that all annotation solutions show high performance against the set of annotations, for example when measuring the F-measure of the annotation solution .
When comparing different NER solutions, it becomes clear that they do not generate the same results depending on their approach, their implementation, and the type of resources used for the instantiation of the solutions (see BioCreative II). On the other side, when combining the results from different automatic annotation solutions, we can achieve an improvement of the results of the combined solution (see BioCreative Meta-Server) . As a consequence, the PPs of the CALBC project have combined their automatic annotation solutions to produce the first Silver Standard Corpus of the CALBC project .
In addition, each annotation solution is optimised for a single semantic type and solutions for a larger scope of semantic groups are still missing. This is again partly due to the fact that manually curated corpora can only cover a small number of semantic groups to focus the ongoing work to the amount of work that is achievable in a fixed period of time and according to the available budget. The proposed approach of the CALBC project can cover a larger number of annotations due to the fact that the annotations are produced automatically and harmonised with automatic means.
In this manuscript, we report on the results of the first CALBC challenge. The CPs have submitted one or several sets of annotated documents. All the submissions have been assessed against the SSC-I. In addition, the submissions have been used to generate the second Silver Standard Corpus (SSC-II) and all the submissions have been assessed against the SSC-II. The results are presented in this manuscript to support a better understanding to which extent the automatic generation of an annotated corpus contributes to the benchmarking of annotation solutions in a domain where a large number of NERs have to be identified inside a large number of scientific documents.
In the CALBC project and challenge the PPs and CPs contribute their annotations on a given corpus to enable the harmonisation of all annotations for a large-scale annotated corpus. A priori we can assume that the annotation solutions do not share any properties and the contributed annotations should be produced by independent systems, but should be similar in the sense that they contribute annotations for entities in the biomedical domain. This leads to the result that the different solutions make use of similar biomedical data resources for the representation of terms and concepts and thus are expected to show similarities in the annotation.
Generation of the first CALBC Silver Standard Corpus (SSC-I)
All PPs annotated the corpus of 150,000 Medline abstracts with their annotation solutions. The project partners P01, P02 and P04 used dictionary-based concept recognition methods with techniques for quality improvements, whereas partner P03 applied a combination of solutions that are either dictionary-based or is based on machine-learning techniques. All annotations were delivered in the IeXML format and concept normalisation should make use of standard resources such as UMLS, UniProtKb, EntrezGene or should follow the UMLS semantic type system [9–13].
The alignment is based on the methods described in [6, 14]. The applied method used pair-wise comparisons of annotated sentences considering all tokens and their order (called “alignment”) between the two sets from two different sources for a given semantic type. For every sentence the annotations from one contribution for a given type is aligned with the annotations from the next contribution for the same semantic type. The tokens have been weighted with the inverse document frequency (IDF) for the tokens across the whole corpus and the cosine similarity of the two annotations has been measured. If the similarity is above 0.98, then the alignment is considered successful and the boundaries of the shorter annotation have been selected as the final annotation (called “harmonisation step”). If the contributions from at least two partner agree on the same annotation (2-vote agreement), then the annotation has been selected for the final corpus. Only in the case of entities belonging to the category CHED, the PPs shared the identical representation of a terminological resource for the annotation task but this did not lead to a higher agreement on the annotations than for the other categories .
Generation of the SSC-II
The contributions of the CPs were evaluated against the SSC-I. Different evaluation schemes were used to determine the performance of the solutions [6, 14]. All contributions were assessed against the SSC-I by applying exact matching, nested matching and cosine similarity matching with a 0.98 and 0.9 cosine similarity score (results not shown). The measurements were performed on the basis of a set of 1,000 Medline abstracts that have been selected at random from the full corpus.
The table gives an overview on the annotation solutions that have been used for the generation of the SSC-I and the SSC-II. For the generation of the SSC-I only the annotations from the 4 project partners (P01 – P04) have been integrated, whereas the SSC-II combines the annotations from the challenge participants (P06-P10, P13 and P15), not including P11, P12 and P14, since they have used the training data. Please refer to the proceedings of the first CALBC workshop for further details .
PPs | CPs
Use of Training Data
Dictionary-based concept recognition
[ / ]
[ / ]
Different resources incl. UniProtKb, EntrezGene
[ / ]
MeSH, MedDRA, NCI, SNOMED-CT UMLS
NCI, MeSH, SNOMED-CT
[ / ]
[ / ]
[ / ]
Indexing of tokens and terms
[ / ]
Both, trained & rule-based solutions
[ / ]
[ / ]
CRF based, trained NER solution
[ / ]
[ / ]
[ / ]
[ / ]
[ / ]
[ / ]
[ / ]
[ / ]
[ / ]
[ / ]
[ / ]
[ / ]
[ / ]
[ / ]
The alignments of the 100,000 documents were either performed on Sun Fire opteron servers (4 or 8 CPUs, RAM sizes from 32 to 256 Gb RAM, 9-12 hours) or on the compute farm of 700 IBM compute engines (dual CPU, 1.2-2.8 Ghz, 2 GB RAM, 3 hours).
Challenge participation and challenge contributions
The table shows the number of annotations that are contained in the SSC-I. This corpus has been generated from the contributions of the PPs. Not all challenge participants (CPs) have participated in all parts of the challenge. A smaller number of CPs has submitted annotations for chemical entities. The average number of annotations for CHED and PRGE in the submitted corpora was above the number of annotations in the SSC-I and for DISO and SPE below the number of the ones in the SSC-I.
Nr. of annotations in SSC-I
Nr. of CPs
Nr. of Submissions from CPs
Average nr. of annotations from all CPs
Nr. of annotations in SSC-II
Five CPs only focused on a single semantic group. All the other CPs covered three or more semantic groups. CP P10 delivered for PRGE a very high number of annotations, which impaired the performance of the system against the SSC-I.
The PPs contributions have been aligned to generate the SSC-I. The SSC-I has been contributed to the public to train machine-learning based NER solutions on the corpus and to gather the annotations of the CPs for performance assessments. The contributions of the CPs have been used to generate the SSC-II.
Evaluation of the contributed annotated corpora against the SSC-I
The table shows the F-measure performance of the PPs and the CPs against the SSC-I (cos-98 harmonisation, 2 vote agreement). The project partners are part of the comparison (P01 – P04). P11, P12, and P14 used the training data for their annotations. Only the best performing submission of each CP was included into the analysis. P09 only contributed a small number of annotations in the submitted corpus.
Performance measures of the CPs’ solutions against the SSC-I
The performances of the annotation solutions against the SSC-I and the SSC-II .
The two best-performing machine-learning based solutions produce results that are comparable to known solutions for the gene mention task [16, 17]. On the other side, the performances have been measured against a corpus that includes a higher degree of variability in the annotations in comparison to the gold standard corpora that are usually used for the measurement of gene-tagging solutions.
The diagram for the identification of the diseases (DISO, fig. 3) demonstrates that the majority of the proposed systems identified the diseases at a recall of 60% and above, and at a precision of 55% and above. Two rule-based solutions from CPs showed similar performances to the PP’s solutions. We can conclude that the representation of the diseases in the SSC-I is better standardised and thus includes less variability or noise than the representation of proteins/genes and chemical entities.
The identification of species could be solved to the best precision and the best recall values from the large majority of all proposed solutions. Again the two best performances were achieved by two machine-learning approaches that reproduced the annotations from the training data. The performances of the other solutions, i.e. the PPs’ solutions and the CPs’ solutions, had the best performances for the identification of species in contrast to the other tasks. It is clear that the identification of species can be performed at a level of quality which is above the measured performances of the other semantic groups.
Performance against the SSC-II
F-measure performance of the contributions from the PPs and the challenge participants against the SSC-II (harmonisation: 98% cosine similarity, 3 vote agreement, 1,030 documents, see Material & Methods).
Tagging of proteins/genes and chemical entities measured against the SSC-II
The performances of the PPs’ annotation solutions for genes/proteins showed lower results in the assessment against the SSC-II than in comparison to the SSC-I (ref. to fig. 1). Since the SSC-II represents the harmonisation of annotations across a larger number of contributions, it can be expected that the annotations in the SSC-II are more heterogeneous than in the SSC-I.
The performance of the CPs’ annotation solutions has improved against the SSC-II in comparison to the SSC-I: the precision against the SSC-II has increased in comparison to the SSC-I. Recall has also improved. This result shows that the SSC-II incorporates characteristic features that are shared amongst all annotation solutions.
In the SSC-II the annotation solutions of the PPs for chemical entities show lower performance in comparison to the SSC-I (refer to fig. 2). The performance of the CPs’ annotation solutions has improved. Altogether the distribution of the performances of the PPs’ annotation solutions and the CPs’ solutions is comparable.
The performances of the CPs’ annotation solutions have improved when moving from the SSC-I to the SSC-II. This result can be explained by the fact that the contributions of the CPs have been included into the SSC-II in comparison to the SSC-I.
The results from the comparison of the annotation solutions for the chemical entities are not as clear as the results for the annotation of proteins/genes. In the case of the chemical entities, the performances of the PPs’ solutions deteriorate except for one PP. The performance of the CPs’ solutions varies to a small extent.
Tagging of diseases and species measured against the SSC-II
Similar to the assessment of disease annotations, the species tagging solutions of the PPs and the project CPs did not vary when the annotations were evaluated against the SSC-II in comparison to the SSC-I. For both corpora, the annotation solutions yielded similar results. This leads to the conclusion that the SSC-I and the SSC-II have similar annotations and also to the result that the different contributing systems had similar performances right from the beginning. Overall, we can conclude that the representation of species is better normalised or standardised in the scientific literature than chemical entities or gene/protein representations.
The table shows the direct measurement of the SSC-I against the SSC-II that has been generated with the similarity measure of 98% cosine similarity scoring and a 3-vote agreement between the participants. The comparison is based on a 98% cosine similarity score.
Reference SSC-I (cos 0.98)
Direct measurement of the SSC-I against the SSC-II
In the direct comparison between the SSC-I and the SSC-II, the annotations for SPE and DISO show better agreement than the comparison of the annotations for PRGE and CHED. The latter shows the lowest performance indicating that higher diversity exists between the two corpora.
Discussion & conclusions
Manual inspection of the SSC-I and the SSC-II
The manual analysis of the SSC-I and the SSC-II is ongoing work. Due to the size of the corpus, it requires special IT solutions to oversee the regularities and irregularities in the corpus. A selection of irregularities result from the methods applied. First, a number of annotations are not captured (“false negatives”, FN, reduced recall) if none of the solutions identifies the entities. An increasing number of contributing annotation solutions reduces the risk that annotations are missed: a bigger number of included annotation solutions lead to a bigger number of annotations that are captured. This achievement is counterbalanced by the number of agreements that have to be available at minimum to accept an annotation.
Second, for the same type of entity, e.g. “insulin”, different annotation solutions use a different tag, e.g. PRGE instead of CHED and vice versa. The harmonisation of the corpus can account for this, but will not produce this type of polysemous annotation throughout the whole corpus, since not all mentions have been consistently annotated with the two different groups over the whole corpus.
Third, inflections of terms, e.g. “tumour” vs. “tumours” and “bear” vs. “bearing”, lead to disagreements between the different annotation solutions. In the first case, the inflectional variability could be resolved and would lead to higher agreement, in the second case assumptions about the usage of the verb or noun have to be made to resolve conflicts.
The comparison of the proposed solutions against the SSC-I is a new approach to evaluate annotation solutions. Until now, no large-scale corpus was available to achieve this task. In addition, it became clear that the SSC-I is homogeneous enough to be used as training data to achieve the same annotation task across the different semantic groups.
The generation of a harmonised corpus is a challenging task, but the presented results demonstrate that the produced harmonised corpus integrates the characteristics from the different annotation solutions. As a result, we can determine the features in the harmonised corpus by the annotation solutions that contribute to the generation of the SSC.
From a different perspective, we can argue that each of the used annotation solutions represents a piece of the complete annotation task. The more solutions are combined, the more closely we approximate an assumed consensus in the annotation task, which can be reproduced with a machine-learning tagging solution.
This work was funded by the EU Support Action grant 231727 under the 7th EU Framework Programme within Theme “Intelligent Content and Semantics” (ICT 2007.4.2). The research work in its first unrevised form was presented at the SMBM 2010, Hinxton, Cambridge, U.K. The work performed at IMIM (Laura Furlong) was funded by the EU Support Action grant 231727 under the 7th EU Framework Programme within Theme “Intelligent Content and Semantics” (ICT 2007.4.2) and the Instituto de Salud Carlos III FEDER (CP10/00524) grant. Fabio Rinaldi and Simon Clematide are supported by the Swiss National Science Foundation (grant 105315_130558/1).
This article has been published as part of Journal of Biomedical Semantics Volume 2 Supplement 5, 2011: Proceedings of the Fourth International Symposium on Semantic Mining in Biomedicine (SMBM). The full contents of the supplement are available online at http://www.jbiomedsem.com/supplements/2/S5.
- Hirschman L, Yeh A, Blaschke C, Valencia A: Overview of BioCreAtIvE: Critical assessment of information extraction for biology. BMC Bioinformatics. 2005, 6 (Suppl 1): S1-10.1186/1471-2105-6-S1-S1.View ArticleGoogle Scholar
- Krallinger M, Morgan A, Smith L, Leitner F, Ta-nabe L, Wilbur J, Hirschman L, Valencia A: Evaluation of textmining systems for biology: Overview of the Second BioCreAtIvE Community Challenge. Genome Biology. 2008, 9 (Suppl 2): S1-10.1186/gb-2008-9-s2-s1.View ArticleGoogle Scholar
- Kim JD, Ohta T, Tsuruoka Y, Tateishi Y, Collier N: Introduction to the bio-entity recognition task at JNLPBA. Proceedings of the JNLPBA-04. 2004, Geneva, Switzerland, 70-75.View ArticleGoogle Scholar
- Kim JD, Ohta T, Pyysalo S, Kano Y, Tsujii J: Overview of BioNLP’09 Shared Task on Event Extraction. Proceedings of the Workshop on BioNLP: Shared Task. 2009, Colorado, USA, 1-9.View ArticleGoogle Scholar
- LLL’05 challenge. [http://www.cs.york.ac.uk/aig/lll/lll05/]
- Rebholz-Schuhmann D, J A, Jimeno Yepes EM, van Mulligen N, Kang J, Kors D, Milward P, Corbett E, Buyko K, Tomanek E Beisswanger, Hahn U: The CALBC Silver Standard Corpus for Biomedical Named Entities: A Study in Harmonizing the Contributions from Four Independent Named Entity Taggers. Proc. LREC 2010. 2010, ELRA, Valletta, MaltaGoogle Scholar
- Leitner F, Krallinger M, Rodriguez-Penagos C, Hakenberg J, Plake C, Kuo CJ, Hsu CN, Tsai RT, Hung HC, Lau WW, Johnson CA, Saetre R, Yoshida K, Chen YH, Kim S, Shin SY, Zhang BT, Baumgartner WA, Hunter L, Haddow B, Matthews M, Wang X, Ruch P, Ehrler F, Ozgür A, Erkan G, Radev DR, Krauthammer M, Luong T, Hoffmann R, Sander C, Valencia A: Introducing meta-services for biomedical information extraction. Genome Biol. 2008, 9 (Suppl 2): S6-10.1186/gb-2008-9-s2-s6.View ArticleGoogle Scholar
- Proceedings of the First CALBC Workshop. [http://www.ebi.ac.uk/Rebholz-srv/CALBC/docs/FirstProceedings.pdf]
- Rebholz-Schuhmann D, Kirsch H, Nenadic G: IeXML: towards a framework for interoperability of text processing modules to improve annotation of semantic types in biomedical text. Proc. of BioLINK, ISMB 2006. 2006, Fortaleza, BrazilGoogle Scholar
- Bodenreider O, McCray A: Exploring semantic groups through visual approaches. Journal of Biomedical Informatics. 2003, 36 (6): 414-432. 10.1016/j.jbi.2003.11.002.View ArticleGoogle Scholar
- Bodenreider O: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004, 32 (Database issue): D267-270.View ArticleGoogle Scholar
- The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res. 2009, 37 (Database issue): D169-174.Google Scholar
- Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 2007, 35 (Database issue): D26-31.View ArticleGoogle Scholar
- Rebholz-Schuhmann D, Jimeno Yepes A, Van Mulligen E, Kang N, Kors J, Milward D, Corbett P, Buyko E, Beisswanger E, Hahn U: "CALBC Silver Standard Corpus.". J Bioinform Comput Biol 2010. 2010, 8 (1): 163-79. 10.1142/S0219720010004562.View ArticleGoogle Scholar
- Hettne KM, Stierum RH, Schuemie MJ, Hendriksen PJ, Schijvenaars BJ, van Mulligen EM, Kleinjans J, Kors JA: A dictionary to identify small molecules and drugs in free text. Bioinformatics. 2009, 25: 2983-91. 10.1093/bioinformatics/btp535.View ArticleGoogle Scholar
- Leaman R, Gonzalez G: BANNER: An executable survey of advances in biomedical named entity recognition. Proceedings of the Pacific Symposium on Biocomputing. 2008, Hawaii, 13: 652-663.Google Scholar
- Torii M, Hu Z, Wu CH, Liu H: BioTagger-GM: a gene/protein name recognition system. J Am Med Inform. 2009, 16: 247-255. 10.1197/jamia.M2844.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.