The rewrite rules were implemented to increase the recall of UMLS concepts in text. The suppression rules on the other hand were implemented to rid the UMLS of terms that are undesired when it comes to term identification either because they affect the precision of the term identification, e.g. the synonym "2" for the term "clinical class", the synonym "EC 2.7.1.-" for the concept "human CDC7 protein", or because they affect the efficiency of the term identification, i.e. long and vague terms that are unlikely to be found in text such as the term "poisoning by other and unspecified drugs and medicinal substances" or terms that are useless for concept identification such as the concept with the single term "WHILE". We applied the rules to the 2007AA version of the UMLS in UTF8 coding and then indexed citations from the MEDLINE database (1965-2007) (we refer to this as "the corpus" in the rest of the paper). Finally, the identified rewritten terms were manually assessed for their correspondence to the original UMLS terms and the identified suppressed terms were manually assessed for their usefulness for automatic text mining purposes. A detailed description of the procedure follows.
UMLS extraction
The UMLS 2007AA version was downloaded from the UMLS knowledge source server [28] and installed locally using the MetamorphoSys tool provided by NLM for customizing the UMLS. The default settings in MetamorphoSys were used to create the UMLS subset, using the option to include all vocabularies in the English language. Strings marked as suppressible by the NLM as well as strings longer than 255 characters were not included in the analysis. This approach resulted in 2,844,004 strings, based on the String Unique Identifier (SUI) field in the UMLS. These strings belonged to 1,294,936 concepts, based on the CUI field in the UMLS. Duplicate strings within a concept were removed by comparing strings after conversion to lower case and removal of punctuation; 2,696,820 strings remained and these are henceforth referred to as "terms".
Corpus creation
All MEDLINE citations (title and abstract) available at the time of this study, with publication dates ranging from January 1965 to December 2007 (17,674,805 citations, of which 9,446,335 have an abstract) were used as a test corpus.
Creation of rules
A set of nine rewrite rules and eight suppression rules were given. A description of the rules together with motivation and differences in comparison to original source (when applicable) is provided below. In order to avoid introducing duplicates and homonyms when applying the rewrite rules, a new term was not added to the concept if it could already be found among the synonyms for that concept or any other concept (case insensitive matching after removal of punctuation).
1) Rewrite rules
Syntactic inversion[24, 26]: add syntactic inversion of term if a term contains a comma followed by a space and does not contain a preposition or conjunction (e.g. "Failure, Renal"). We added the condition that only one such pattern of a comma followed by a space is to be found in a term for the rule to be executed.
Possessives[26]: remove the possessive "'s" at the end of a word (e.g. "Alzheimer's disease") and add the rewritten term.
Short form/long form[29]: add short form and long form of term (e.g. "Selective Serotonin Reuptake Inhibitors (SSRIs)"). Schwartz and Hearst's algorithm [29] achieved 96% precision and 82% recall on a standard test collection, which was as good as existing approaches at the time [29] and still competitive according to recent comparison studies [30, 31]. An advantage of the algorithm is that, unlike other approaches, it does not require any training data. Two extra conditions were added to the original rule by Schwartz and Hearst: 1) the short form must be found at the end of the term, and 2) the first letter of the short form should be the same as the first letter of the long form. These conditions were added in order to adjust the rule to extract abbreviations from a dictionary instead of from biomedical text.
A ngular brackets[26]: remove expressions within angular brackets anywhere in a term. This pattern was previously used in the UMLS to denote polysemy or homonymy of a term, i.e. a term having different meanings. Terms having this property still exist in the UMLS, even though the property is not assigned to new terms. We have adjusted the rule to remove expressions within angular brackets anywhere in a term since these expressions usually contain meta-information about a term, which is unlikely to be found in text (e.g. "Chondria <beetle>").
Semantic type: remove expressions within parentheses that match the list of semantic types in the UMLS (e.g. "Surgical intervention (finding)"). This rule was developed by our group based on the observation that the semantic type to which the term belongs to is often added as meta-information about the term.
Non-essential parentheticals[24, 26] has been split into four rules in order to make the error analysis more transparent:
-
1.
Begin parentheses: remove expressions within parenthesis at the beginning of a term (e.g. (protein) methionine-R-sulfoxide reductase)
-
2.
Begin brackets: remove expressions within brackets at the beginning of a term (e.g. [V] Alcohol use)
-
3.
End parentheses removes expressions within parenthesis at the end of a term (e.g. flagellar filament (sensu Bacteria))
-
4.
End brackets removes expressions within brackets at the end of a term (e.g. Gluten-free foods [generic 1])
In addition, we have added the condition that the rule does not apply to terms belonging to the semantic group Chemicals & Drugs. The reason for this condition is that chemical expressions by nature often contain both brackets and parentheses at the beginning or end of a term.
2) Suppression rules
Short token[24, 26]: remove term if the whole term after tokenization and removal of stop words is a single character, or is an arabic or roman number. For this rule, the stop word list from PubMed [32] was used. This rule differs from the one in [24, 26] in that it takes each token into account separately (e.g. the term "10*9/L" would be tokenised to "10 9 L" and removed by this rule since every token either is a number or a single character).
Dosages[24]: the original rule addressed terms belonging to certain term types defined by the NLM in the UMLS, namely BD (Fully-specified drug brand name that can be prescribed), CD (Clinical Drug) or MS (Multiple names of branded and generic supplies or supplements). This rule was further refined by us to remove all terms that contain a dosage in percent, gram, microgram or milliliter (e.g. Oxygen 2%).
At-sign: this rule was implemented by us to remove terms that contain the @-character (e.g. ADHESIVE @@ BANDAGE).
EC numbers[26]: Remove terms that contain enzyme classification numbers as defined by IUPAC (e.g. EC 2.7.1.112). The justification for this rule is that an EC number in the UMLS usually is mapped to a specific enzyme while it actually refers to a class of enzymes.
Any classification[24]: remove terms containing the following properties: "NEC" at the end of a term and preceded by a comma, "NEC" within parentheses or brackets at the end of a term and preceded by a space, "not elsewhere classified", "unclassified", "without mention" (e.g. "Unclassified sequences").
Any underspecification[24, 26]: remove terms containing the following properties: "not otherwise specified", "not specified", or "unspecified"; "NOS" at the end of a term and preceded by a comma, or "NOS" within parentheses or brackets at the end of a term and preceded by a space (e.g. "Other and unspecified leukaemia").
Miscellaneous[24, 26]: remove terms containing the following properties: "other" at the beginning of a term and followed by a space character or at the end of a term and preceded by a space character; "deprecated", "unknown", "obsolete", "miscellaneous", or "no" at the beginning of a term and followed by a space character (e.g."Other").
Words > 5[25]: remove terms that contain more that five words (e.g. "Head and Neck Squamous Cell Carcinoma"). This rule is not applied to terms belonging to the semantic group Chemicals & Drugs.
Term and concept recognition
For the term and concept recognition we used our concept recognition software Peregrine [33]. For this study, Peregrine was set up to mimic a minimal, general-purpose concept recognizer performing case-insensitive string lookup (ignoring punctuation), similar to, for instance, TextPresso [34]. Largest match was turned off, meaning that nested terms were counted both as a match for a longer and for a short term. Our choice of set-up was based on the fact that we clearly wanted to see the effect of the rewrite and suppression rules.
Evaluation
Each rule was evaluated separately. To assess the effect of a rule, the difference in the set of terms identified in the corpus before and after applying the rule was determined. For rewrite rules, the number of different additional terms found was determined. In addition, for each term its frequency of occurrence in the corpus was computed. For the suppression rules, the number of different suppressed terms was determined and for each term the number of times it was suppressed in the corpus. A manual analysis of the top 50 most frequent terms and 100 randomly selected terms was performed for each rule. This analysis was used to determine the size of the effect and to judge its quality.