Skip to main content

Table 1 Presentation of the general characteristics of the corpora used in the experiments

From: tESA: a distributional measure for calculating semantic relatedness

  MEDLINE PMC OA Wikipedia
Size 14073912 1024890 3807314
Type Scientific Scientific Encyclopedic
Documents Abstacts and titles Mostly fulltext + abstracts + titles Fulltext + titles
Snapshot date Autumn 2015 September 2015 December 2015
Token count [M] 2531,14; 264,84 3684,89; 15,8 2434,55; 11,13
Unique token count [M] 3,85; 1,24 35,57; 0,48 12,53; 0,98
  1. Token counts and unique token counts are expressed in millions. These statistics are collected for raw texts (before preprocessing) and raw corpora (e.g. there might be an uneven number of titles and abstracts in Medline). For each corpus and count type we provide two metrics - of the documents’ textual contents (abstract or full articles) and titles. The statistics are included to highlight the compositional differences between the corpora