Skip to main content

Table 3 Rewritten or suppressed terms and concepts found in the corpus.

From: Rewriting and suppressing UMLS terms for improved biomedical term identification

Rule

Terms in corpus (all)

Terms in corpus (distinct)

Concepts in corpus (distinct)

Original

3,992,662,340

651,268

397,414

Rewrite rules

   

Syntactic inversion

529,058

12,433

11,291

Possessives

34,211

1,134

946

Short/long form

305,541

216

182

Angular brackets

30,124

743

731

Semantic type

218,838

259

259

Begin parentheses

523

26

25

End parentheses

8,916,764

4,776

4,494

Begin brackets

176,791

274

251

End brackets

65,873

241

236

Suppression rules

   

Dosages

109,246

5,014

4,885

Short token

1,906,901,846

1009

945

At-sign

0

0

0

EC numbers

45,138

149

146

Any classification

6,972

42

36

Any underspecification

9,470

322

290

Miscellaneous

91,576,083

1,257

1,095

Words > 5

179,051

5,734

4,665

  1. "Terms in corpus (all)" indicates the number of occurrences of the new terms generated by the rewrite rules and the terms suppressed by the suppression rules in the corpus. "Terms in corpus (distinct)" and "Concepts in corpus (distinct)" indicate the number of unique terms and concepts produced or suppressed by the rules that were found in the corpus. The row "Original" indicates the total number of terms found in corpus when no rewrite or suppression rule was applied.