From: Rewriting and suppressing UMLS terms for improved biomedical term identification
Rule | Terms in corpus (all) | Terms in corpus (distinct) | Concepts in corpus (distinct) |
---|---|---|---|
Original | 3,992,662,340 | 651,268 | 397,414 |
Rewrite rules | Â | Â | Â |
Syntactic inversion | 529,058 | 12,433 | 11,291 |
Possessives | 34,211 | 1,134 | 946 |
Short/long form | 305,541 | 216 | 182 |
Angular brackets | 30,124 | 743 | 731 |
Semantic type | 218,838 | 259 | 259 |
Begin parentheses | 523 | 26 | 25 |
End parentheses | 8,916,764 | 4,776 | 4,494 |
Begin brackets | 176,791 | 274 | 251 |
End brackets | 65,873 | 241 | 236 |
Suppression rules | Â | Â | Â |
Dosages | 109,246 | 5,014 | 4,885 |
Short token | 1,906,901,846 | 1009 | 945 |
At-sign | 0 | 0 | 0 |
EC numbers | 45,138 | 149 | 146 |
Any classification | 6,972 | 42 | 36 |
Any underspecification | 9,470 | 322 | 290 |
Miscellaneous | 91,576,083 | 1,257 | 1,095 |
Words > 5 | 179,051 | 5,734 | 4,665 |