Skip to main content

Table 5 Corpora used for evaluation of biomedical semantic annotators. The table includes corpora that were used in the reported use cases (“Benefits and Use Cases” section, Table 2), and/or benchmarking of the discussed tools ("Summary of benchmarking results" and "Entity-specific biomedical annotation tools" sections)

From: Semantic annotation in biomedicine: the current landscape

AnEM - Anatomical Entity Mention [76]

The corpus consists of 500 documents selected randomly from citation abstracts and full-text biomedical research papers (from PubMed); it is manually annotated (over 3000 annotations) with anatomical entities. The corpus is available under the open CC-BY-SA license.

URL: http://www.nactem.ac.uk/anatomy/

BC4GO [77]

The corpus, developed for the BioCreative IV shared task, consists of 200 articles (over 5000 text passages) from Model Organism Databases; these articles were manually annotated with more than 1356 distinct GO terms. In addition to the core elements of GO annotations - a gene or gene product, a GO term, and a GO evidence code - the corpus also includes the GO evidence text.

URL: http://www.biocreative.org/tasks/biocreative-iv/track-4-GO/

CALBC - Collaborative Annotation of a Large Biomedical Corpus [78]

A very large, publicly shared corpus of Medline abstracts automatically annotated with biomedical entities; the small corpus comprises ~175 K abstracts, whereas the big one consists of more than 714 K abstracts; since annotations were not made by humans but several annotation systems (and then aggregated), it is referred to as “silver standard”.

URL: http://www.ebi.ac.uk/Rebholz-srv/CALBC/corpora/resources.html

Chemical Disease Relation (CDR) [79]

The corpus, developed for the BioCreative V shared task, consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases, and 3116 chemical-disease interactions. MeSH is used as the controlled vocabulary.

As BC4GO, this corpus is available exclusively for scientific, educational, and/or non-commercial purposes.

URL: http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/

CRAFT - the Colorado Richly Annotated Full Text corpus [80]

Publicly available, human annotated (gold standard) corpus of full-text biomedical journal articles; it consists of 67 document and 87,674 human annotations

URL: http://bionlp-corpora.sourceforge.net/CRAFT/

GENETAG [81]

Publicly available corpus of 20 K Medline sentences manually annotated with gene/protein names. Part of the corpus (15 K sentences) was used for the BioCreative I challenge (Gene Mention Identification task), and the rest (5 K sentences) was used as test data for BioCreative II competition (Gene Mention Tagging Task). URL: https://github.com/openbiocorpora/genetag

An updated version of this corpus, named GENETAG-05, is part of a broader MedTag annotated corpus that was used in the BioCreative I challenge; it is available at: ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedTag/

GENIA [82]

Open access manually annotated corpora consisting of 2000 Medline abstracts (400,000+ words) with almost 100,000 annotations for biological terms. Terms are annotated with concepts from the GENIA ontology, a formal model of cell signaling reactions in humans (the ontology is provided together with the corpus).

Available from the following repository: http://corpora.informatik.hu-berlin.de/

2010 i2b2/VA corpus [83]

The corpus consists of manually annotated de-identified clinical records (discharge summaries and progress reports) from three medical centers. It was originally created for the 2010 i2b2/VA NLP challenge to support 3 kinds of tasks: extraction of medical concepts from patient reports; assigning assertion types to medical problem concepts; and determining the type of relation between medical problems, tests, and treatments. The corpus consists of 394 annotated training reports, 477 annotated test reports, and 877 unannotated reports.

The corpus is made available to the research community from https://i2b2.org/NLP/DataSets under data use agreements.

JNLPBA [84]

A publicly available manually annotated corpus originally created for the Bio-Entity Recognition Task at BioNLP/NLPBA 2004. The training set consists of 2000 Medline abstracts extracted from the GENIA Version 3.02 corpus; the data set is annotated with five entity types: Protein, DNA, RNA, Cell_line, and Cell_type. The test set consists of 404 annotated Medline abstracts, also from the GENIA project; a half of this data set is from the same domain as that of the training data, whereas the other half is from the super domain of blood cells and transcription factors.

URL: http://www.geniaproject.org/shared-tasks/bionlp-jnlpba-shared-task-2004

NCBI Disease corpus [85]

Publicly available, manually annotated corpus of 793 PubMed abstracts; 6892 disease mentions are annotated with concepts from Medical Subject Headings (MeSH) and Online Mendelian Inheritance in Man (OMIM) vocabularies.

URL: https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/

Mantra Gold Standard Corpus [73]

Publicly available multilingual gold-standard corpus for biomedical concept recognition. It includes text from different types of parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. It contains 5530 annotations based on a subset of UMLS that covers a wide range of semantic groups.

URL: http://biosemantics.org/index.php/resources/mantra-gsc

ShARe - Shared Annotated Resources [86]

Gold standard corpus of de-identified clinical free-text notes; it includes 199 documents and 4211 human annotations; originally prepared for the ShARe/CLEF eHealth Evaluation Lab focused on NLP and information retrieval tasks for clinical care.

URL: https://sites.google.com/site/shareclefehealth/data