Skip to main content

Table 1 The Shared Task data sets The top three rows lists the properties of the training data, separately detailing its two components—biomedical abstracts and full articles. The bottom row summarizes the official held-out test data (articles only). Token counts are based on the tokenizer described above.

From: Predicting speculation: a simple disambiguation approach to hedge detection in biomedical literature

Data Set

Sentences

Hedged Sentences

Cues

Multi-Word Cues

Tokens

Cue Tokens

Abstracts

11,871

2,101

2,659

364

309,634

3,056

Articles

2,670

519

668

84

68,579

782

Total (train)

14,541

2,620

3,327

448

378,213

3,838

Held-Out

5,003

790

1,033

87

138,276

1,148