Skip to main content

Table 1 Corpora used for evaluation

From: Generalising semantic category disambiguation with large lexical resources for fun and profit

Name

Semantic categories

Epigenetics and Post-Translational

17

Modifications corpus [35] (EPI)

 

Infectious Diseases corpus [22] (ID)

16

Genia Event corpus [36] (GE)

11

Collaborative Annotation of a Large

4

Biomedical Corpus [37] (SSC)

 

BioNLP/NLPBA 2004 Shared Task

5

corpus [38] (NLPBA)

 

Gene Regulation Event Corpus [39] (GREC)

64 (6)

Multi-Level Event Extraction corpus [21] (MLEE)

52

GeneReg corpus [40] (GReg)

10

Gene Expression Text Miner corpus [41] (GETM)

3

BioInfer [7] (BI)

119 (97)

BioText [42] (BT)

2

CoNLL-2002 Shared Task corpus,

4

Spanish subset [20] (CES)

 

CoNLL-2002 Shared Task corpus, Dutch

4

subset [20] (CNL)

 

i2b2 Medication Challenge corpus [19] (I2B2)

6

OSIRIS corpus [43] (OSIRIS)

2

  1. Parenthesised values signify the actual number of categories after performing pre-processing steps so as to not suffer from data sparseness (GREC conversion into SGREC[3]) or to compensate for ontological design decisions (BI). The mid-line indicates a cut-off between the above corpora used in previous work [3] and the corpora added to evaluate our approach for a variety of domains and covering a large set of semantic categories.