Biomedical text mining (TM) has developed into a bioinformatics discipline leading to the development IT methods that deliver accurate results from an automatic literature analysis into bioinformatics research. This research work requires the development of benchmark data sets containing annotations and thereafter the assessment of existing TM solutions against these corpora. A number of challenges have been proposed to achieve this goal: BioCreAtive I and II, JNLPBA and others [1–5]. In all these approaches, the organisers deliver a set of manually annotated documents and ask the challenge participants (CPs) to reproduce the results with their automatic methods. The annotated corpora are provided to the public after the challenge is closed and all the results are documented and published in a scientific manuscript.
The first CALBC Challenge is similar in the sense that the project partners (PPs) of the CALBC project also provided an annotated corpus to the CPs of the first CALBC challenge to reproduce the annotations with automatic means. On the other side, the first CALBC Challenge was different to the before-mentioned challenges with regards to the following modifications: (1) the annotated corpus has been generated automatically and not manually (Silver Standard Corpus, SSC-I), and (2) the size of the SSC-I is significantly bigger than the corpora mentioned produced for the other challenges, i.e. the annotated corpus contains 50,000 Medline abstracts for training and the corpus for annotation consists of 100,000 test documents. This difference in size requires that all assessment is performed fully automatically, that the CPs apply annotation solutions that can cope with such a large-scale corpus and that the assessment solutions can evaluate the contributions in a short period of time. The automatic annotation of the corpus also requires new solutions to integrate the contributions from different automatic annotation solutions into a single corpus. This process will be called “harmonisation” and refers to methods that measure the agreement between the boundaries from different annotation solutions to filter out entity boundaries that fulfil consensus criteria. Overall these annotations should have the characteristic that all annotation solutions show high performance against the set of annotations, for example when measuring the F-measure of the annotation solution .
When comparing different NER solutions, it becomes clear that they do not generate the same results depending on their approach, their implementation, and the type of resources used for the instantiation of the solutions (see BioCreative II). On the other side, when combining the results from different automatic annotation solutions, we can achieve an improvement of the results of the combined solution (see BioCreative Meta-Server) . As a consequence, the PPs of the CALBC project have combined their automatic annotation solutions to produce the first Silver Standard Corpus of the CALBC project .
In addition, each annotation solution is optimised for a single semantic type and solutions for a larger scope of semantic groups are still missing. This is again partly due to the fact that manually curated corpora can only cover a small number of semantic groups to focus the ongoing work to the amount of work that is achievable in a fixed period of time and according to the available budget. The proposed approach of the CALBC project can cover a larger number of annotations due to the fact that the annotations are produced automatically and harmonised with automatic means.
In this manuscript, we report on the results of the first CALBC challenge. The CPs have submitted one or several sets of annotated documents. All the submissions have been assessed against the SSC-I. In addition, the submissions have been used to generate the second Silver Standard Corpus (SSC-II) and all the submissions have been assessed against the SSC-II. The results are presented in this manuscript to support a better understanding to which extent the automatic generation of an annotated corpus contributes to the benchmarking of annotation solutions in a domain where a large number of NERs have to be identified inside a large number of scientific documents.