Skip to main content

Table 1 Presentation of the general characteristics of the corpora used in the experiments

From: tESA: a distributional measure for calculating semantic relatedness

 

MEDLINE

PMC OA

Wikipedia

Size

14073912

1024890

3807314

Type

Scientific

Scientific

Encyclopedic

Documents

Abstacts and titles

Mostly fulltext + abstracts + titles

Fulltext + titles

Snapshot date

Autumn 2015

September 2015

December 2015

Token count [M]

2531,14; 264,84

3684,89; 15,8

2434,55; 11,13

Unique token count [M]

3,85; 1,24

35,57; 0,48

12,53; 0,98

  1. Token counts and unique token counts are expressed in millions. These statistics are collected for raw texts (before preprocessing) and raw corpora (e.g. there might be an uneven number of titles and abstracts in Medline). For each corpus and count type we provide two metrics - of the documents’ textual contents (abstract or full articles) and titles. The statistics are included to highlight the compositional differences between the corpora