Skip to main content

Table 3 Rewritten or suppressed terms and concepts found in the corpus.

From: Rewriting and suppressing UMLS terms for improved biomedical term identification

Rule Terms in corpus (all) Terms in corpus (distinct) Concepts in corpus (distinct)
Original 3,992,662,340 651,268 397,414
Rewrite rules    
Syntactic inversion 529,058 12,433 11,291
Possessives 34,211 1,134 946
Short/long form 305,541 216 182
Angular brackets 30,124 743 731
Semantic type 218,838 259 259
Begin parentheses 523 26 25
End parentheses 8,916,764 4,776 4,494
Begin brackets 176,791 274 251
End brackets 65,873 241 236
Suppression rules    
Dosages 109,246 5,014 4,885
Short token 1,906,901,846 1009 945
At-sign 0 0 0
EC numbers 45,138 149 146
Any classification 6,972 42 36
Any underspecification 9,470 322 290
Miscellaneous 91,576,083 1,257 1,095
Words > 5 179,051 5,734 4,665
  1. "Terms in corpus (all)" indicates the number of occurrences of the new terms generated by the rewrite rules and the terms suppressed by the suppression rules in the corpus. "Terms in corpus (distinct)" and "Concepts in corpus (distinct)" indicate the number of unique terms and concepts produced or suppressed by the rules that were found in the corpus. The row "Original" indicates the total number of terms found in corpus when no rewrite or suppression rule was applied.