Skip to main content

Table 1 Corpora used for evaluation

From: Generalising semantic category disambiguation with large lexical resources for fun and profit

Name Semantic categories
Epigenetics and Post-Translational 17
Modifications corpus [35] (EPI)  
Infectious Diseases corpus [22] (ID) 16
Genia Event corpus [36] (GE) 11
Collaborative Annotation of a Large 4
Biomedical Corpus [37] (SSC)  
BioNLP/NLPBA 2004 Shared Task 5
corpus [38] (NLPBA)  
Gene Regulation Event Corpus [39] (GREC) 64 (6)
Multi-Level Event Extraction corpus [21] (MLEE) 52
GeneReg corpus [40] (GReg) 10
Gene Expression Text Miner corpus [41] (GETM) 3
BioInfer [7] (BI) 119 (97)
BioText [42] (BT) 2
CoNLL-2002 Shared Task corpus, 4
Spanish subset [20] (CES)  
CoNLL-2002 Shared Task corpus, Dutch 4
subset [20] (CNL)  
i2b2 Medication Challenge corpus [19] (I2B2) 6
OSIRIS corpus [43] (OSIRIS) 2
  1. Parenthesised values signify the actual number of categories after performing pre-processing steps so as to not suffer from data sparseness (GREC conversion into SGREC[3]) or to compensate for ontological design decisions (BI). The mid-line indicates a cut-off between the above corpora used in previous work [3] and the corpora added to evaluate our approach for a variety of domains and covering a large set of semantic categories.