Skip to main content

Table 2 Type and token frequencies of terms

From: MedEval — A Swedish medical test collection with doctors and patients user groups

 

Entire collection

Assessed documents

Doctors assessed

Patients assessed

Common files

Doctors relevant

Patients relevant

Number of documents

42,250

7,044

3,272

4,334

562

1,233

1,654

Tokens

12,991,157

5,034,323

3,232,772

2,431,160

629,609

1,361,700

988,236

Tokens/document

307

715

988

561

1,120

1,104

596

Average word length

5.75

6.04

6.29

5.73

6.16

6.33

5.63

Full form types

334,559

181,354

154,901

92,803

50,961

87,814

43,825

Lemma types

267,892

146,631

126,217

73,121

40,857

71,974

34,263

Lemma type token ratio

48.5

34.3

25.6

33.2

15.4

18.9

28.8

Compound tokens

1,273,874

573,625

412,475

237,267

76,117

179,580

92,420

Full form compound types

187,904

99,614

83,846

47,387

24,083

45,257

20,157

Lemma compound types

144,159

78,508

66,907

37,151

19,685

36,867

16,006

Ratio of compounds

0.098

0.114

0.128

0.098

0.120

0.132

0.094

  1. Statistics for different categories of terms in different subsets of documents in the MedEval test collection.