Skip to main content

Table 1 The Shared Task data sets The top three rows lists the properties of the training data, separately detailing its two components—biomedical abstracts and full articles. The bottom row summarizes the official held-out test data (articles only). Token counts are based on the tokenizer described above.

From: Predicting speculation: a simple disambiguation approach to hedge detection in biomedical literature

Data Set Sentences Hedged Sentences Cues Multi-Word Cues Tokens Cue Tokens
Abstracts 11,871 2,101 2,659 364 309,634 3,056
Articles 2,670 519 668 84 68,579 782
Total (train) 14,541 2,620 3,327 448 378,213 3,838
Held-Out 5,003 790 1,033 87 138,276 1,148