Unsupervised grammar induction of clinical report sublanguage
© Kate; licensee BioMed Central Ltd. 2012
Published: 5 October 2012
Clinical reports are written using a subset of natural language while employing many domain-specific terms; such a language is also known as a sublanguage for a scientific or a technical domain. Different genres of clinical reports use different sublaguages, and in addition, different medical facilities use different medical language conventions. This makes supervised training of a parser for clinical sentences very difficult as it would require expensive annotation effort to adapt to every type of clinical text.
In this paper, we present an unsupervised method which automatically induces a grammar and a parser for the sublanguage of a given genre of clinical reports from a corpus with no annotations. In order to capture sentence structures specific to clinical domains, the grammar is induced in terms of semantic classes of clinical terms in addition to part-of-speech tags. Our method induces grammar by minimizing the combined encoding cost of the grammar and the corresponding sentence derivations. The probabilities for the productions of the induced grammar are then learned from the unannotated corpus using an instance of the expectation-maximization algorithm.
Our experiments show that the induced grammar is able to parse novel sentences. Using a dataset of discharge summary sentences with no annotations, our method obtains 60.5% F-measure for parse-bracketing on sentences of maximum length 10. By varying a parameter, the method can induce a range of grammars, from very specific to very general, and obtains the best performance in between the two extremes.
Obtaining a syntactic parse is an important step in analyzing a sentence. Syntactic parsers are typically built using supervised learning methods. Several hundred or thousand sentences are first manually annotated with syntactic parses. Then some learning method is employed that learns from this annotated data how to parse novel sentences. A major drawback of this approach is that it requires a lot of manual effort from trained linguists to annotate sentences. Also, a parser trained in one domain does not do well on another domain without adapting it with extra annotations from the new domain. This drawback becomes even more severe when the domain is that of clinical reports or medical text, because the annotators need to be not only trained linguists but also need to have sufficient clinical knowledge to understand the clinical terms and the sentence forms. This is a rare combination of expertise, which makes the annotation process for clinical reports even more expensive. Also, different genres of clinical reports, like discharge summaries, radiology notes, cardiology reports etc., are different from each other and hence will require separate annotations. On top of that, different hospitals or medical centers may be using their own convention of clinical terms and sentence styles in writing clinical reports which may require separate annotation effort to adapt a syntactic parser to work for clinical reports across institutions.
Besides the annotation effort required, another drawback of supervised syntactic parsing is that it forces a particular "gold standard" of syntactic parses which may not be best suited for the end-application in which the syntactic parses will be used. For example, in the application of semantic parsing, the task of converting a sentence into an executable meaning representation, it was found that the conventional gold standard syntactic parse trees were not always isomorphic with their semantic trees  which lowered the performance of semantic parsing. In the domain of clinical reports, where sentences are often succinct and may not follow typical English grammar, it is not easy to decide the gold standard parses in advance. For example, a sentence like "Vitamin B12 250 mcg daily" could be parsed with brackets such as "((Vitamin B12 250 mcg) daily)" or such as "((Vitamin B12) (250 mcg daily))" depending upon whether the end-application emphasizes the "250 mcg" quantity with "Vitamin B12" or with "daily". However, during the annotation process, a particular form will get forced as part of the annotation convention without regards to what may be better suited for the end-application down the road. Syntactic parses are not an end in themselves but an intermediate form which is supposed to help an end-application, hence it will be best if such an intermediate form is not set in advance but gets decided based on the end-application. An alternate to supervised learning for building parsers is unsupervised learning. In this framework, a large set of unannotated sentences, which are often easily obtainable, are given to an unsupervised learning method. Using some criteria or bias, for example, simplicity of grammar and corresponding sentence derivations, the method tries to induce a grammar that best fits all the sentences. Novel sentences are then parsed using this learned grammar. While unsupervised parsing methods are not as accurate as supervised methods, their no demand of manual supervision makes them an attractive alternative, especially for the domain of clinical reports for the reasons pointed out earlier.
An additional advantage of unsupervised parsing is that the grammar induction process itself may be adapted so as to do best on the end-application. For example, instead of using a simplicity bias to guide the grammar induction process, a criterion to maximize accuracy on the end-application may be used. This way the induced grammar may choose one parse over another for the "Vitamin B12 250 mcg daily" depending upon which way is more helpful for the end-application. In , an analogous approach was used to transform a semantic grammar to best suit the semantic parsing application.
In this paper, we present an approach for unsupervised grammar induction for clinical reports, which to our knowledge is the first such attempt. We adapt and extend the simplicity bias (or cost reduction) method  of unsupervised grammar induction. We chose to use this method because its iterative grammar-modifying process using grammar transformation operators is amenable for adaptation to any criterion besides simplicity bias. This could be useful for adapting the grammar induction process to maximally benefit some end-application. Another advantage of this method is that it directly gives the grammar in terms of non-terminals it creates on its own; some other existing methods only give bracketing [4, 5] or force the user to specify the number of non-terminals . The induced grammar is also not restricted to binary form unlike in some previous methods [6, 7]. After inducing the grammar, in order to do statistical parsing, the probabilities for its production are obtained using an instance of the expectation-maximization (EM) algorithm  run over the unannotated training sentences.
In the experiments, we first show that the learned grammar is able to parse novel sentences. Measuring accuracy of parses obtained through an unsupervised parsing method is always challenging, because the parses obtained by the unsupervised method may be good in some way even though they may not match the correct parses. The ideal way to measure the performance of unsupervised parsing is to measure how well it helps in an end-application. However, at present, in order to measure the parsing accuracy, we annotated one hundred sentences with parsing brackets and measured how well they match against the brackets obtained when parsed with the induced grammar.
Cost reduction method for grammar induction
For inducing a context-free grammar from training sentences, we adapted the cost reduction method  which was based on Wolff's idea of language and data compression , also known as simplicity bias method or minimum description length method. The method starts with a large trivial grammar which has a separate production corresponding to each training sentence. It then heuristically searches for a smaller grammar as well as simpler sentence derivations by repeatedly applying grammar transformation operators of combining and merging non-terminals. The size of the grammar and derivations is measured in terms of their encoding cost. We have extended this method in a few ways. We describe the method and our extensions in this section. We first describe how the cost is computed in Subsection and then describe the search procedure that searches for the grammar that leads to the minimum cost in Subsection. In Subsection, we describe how the probabilities associated with the productions of the induced grammar are computed.
Computing the cost
The method uses ideas from information theory and views the grammar as a means to compress the description of the given set of unannotated training sentences. It measures the compression in terms of two types of costs. The first is the cost (in bits) of encoding the grammar itself. The second is the cost of encoding the sentence derivations using that grammar. In the following description we make use of some of the notations from .
Cost of grammar
where p is the number of productions and β i is the RHS of the i th production.
Cost of derivations
where C G is the cost of the grammar and C D is the cost of all derivations as described before. Note that f = 0.5 is equivalent to adding the two components as in the previous work. In the experiments, we vary this parameter and empirically measure the performance.
Grammar search for minimum cost
It is important to point out that there is a trade-off between the cost of the grammar and the cost of the derivations. At one extreme is a simplest grammar in which there are productions like NT → t i , i.e. a non-terminal NT that expands to every terminal t i , and two more productions S → NT and S → SS, (S being the start symbol) which will have a very little cost. However, this grammar will lead to very long and expensive derivations. It is also worth pointing out that this grammar is overly general and will parse any sequence of terminals.
On the other extreme is a grammar in which each production encodes an entire sentence from the training set, for example, S → w1w2..w n , where w1, w2 etc. are words of a sentence. The derivations of this grammar will have very little cost, however, the grammar will be very expensive as it will have long productions and as many of them as the number of sentences. It is also worth pointing out that this grammar is overly specific and will not parse any other sentence besides the ones in the training set. Hence the best grammar lies in between the two extremes, which will be general enough to parse novel sentences but at the same time not too general to parse almost any sequence of terminals. This grammar will also have a smaller cost than either extreme. According to the minimum description length principle as well as Occam's razor principle, a grammar with minimum cost is likely to have the best generalization. We use the following search procedure to find the grammar which gives the minimum total cost where the total cost is as defined in equation 5. We note that by varying the value of the parameter f in that definition, the minimum cost search procedure can find different extremes of the grammars. For example, with f = 1, it will find the first type of extreme grammar with the least grammar cost, and with f = 0, it will find the second type of extreme grammar with the least derivation cost.
The search procedure begins with a trivial grammar which is similar to the second extreme type of grammar mentioned before. A separate production is included for each unique sentence in the training data. If the sentence is w1w2..w n , a production S → W1W2..W n is included along with productions W1 → w1, W2 → w2, etc., where W1, W2, etc. are new non-terminals corresponding to the respective terminals w1, w2, etc. The new non-terminals are introduced because the grammar transformation operators described below do not directly work with terminals. Instances of the two grammar transformation operators described below are then applied in a sequence in a greedy manner, each time reducing the total cost. We first describe the two operators, combine and merge, and then describe the greedy procedure that applies them. While the merge operator is same as in , we have generalized the combine operator (which they called create operator). The search procedure is analogous to theirs but we first efficiently estimate the reductions in cost obtained by different instances of the operators and then apply the one which gives the most reduction in cost. They on the other hand do not estimate the reductions in cost but actually generate new grammars for all instances of the operators and then calculate the reductions in cost. They also follow separate loops of applying a series of merge and combine operators, but we follow only one loop for both the operators.
This operator combines two or more non-terminals to form a new non-terminal. For example, if the non-terminals "DT ADJ NN" appear very often in the current grammar, then the cost (equivalently size) of the grammar can be reduced by introducing a new production C 1 → DT ADJ NN, where C 1 is a system generated non-terminal. Next, all the occurrences of DT ADJ NN on the RHS of the productions will be replaced by C 1. As can be seen, this reduces the size of all those productions but at the same time adds a new production and a new non-terminal. In , the corresponding operator only combined two non-terminals at a time and could combine more than two non-terminals only upon multiple applications of the operator (for example, first combine DT and ADJ into C1 and then combine C1 and NN into C2). But we found this was less cost-effective in the search procedure than directly combining multiple non-terminals, hence we generalized the operator.
It may be noted that this operator only changes the cost of the grammar and not the cost of the derivation. This is so because in the derivations, the only change will be the application of the extra production (like C 1 → DT ADJ NN), and since there is only one way to expand the new non-terminal C 1, there is no need to encode it (i.e. |P(C 1)| is 1, hence its log is zero in equation 4). It is also interesting to note that this operator does not increase the coverage of the grammar, i.e., the new grammar obtained after applying the combine operator will not be able to parse any new sentence that it could not parse before. The coverage does not decrease either.
The reduction in cost due to applying any instance of this operator can be estimated easily in terms of the number of non-terminals being combined and how many times they occur adjacently on the RHS of current productions in the grammar. Note that if the non-terminals do not appear adjacent enough number of times then this operator can, in fact, increase the cost.
This operator merges two non-terminals into one. For example, it may replace all instances of NNP and NNS non-terminals in the grammar by a new non-terminal M 1. This operator is the same as in ; we did not generalize it to merging more than two non-terminals, because unlike the combine operator, it is combinatorially expensive to find the right combination of non-terminals to merge (for the combine operator, we describe this procedure in the next subsection).
The merge operator could eliminate some productions. For example, if there were two productions NP → DT NNP and NP → DT NNS, then upon merging NNP and NNS into M 1, both the productions reduce to the same production NP → DT M 1. This not only reduces the cost of the grammar by reducing its size, but also reduces the |P(NP)| value (how many productions have NP on LHS) which results into a further decrease in the derivation cost (equation 4). However, if there were productions with NNP and NNS on the LHS, then their getting combined will make the cost of |P(M 1)| equal to the sum of |P(NNP)| and |P(NNS)| and replacing NNP and NNS by M 1 everywhere in the derivations will increase the cost of the derivations.
To estimate the reduction in cost due to applying any instance of this operator, one needs to estimate which productions will get merged (hence eliminated) and in how many other productions does the non-terminal on the LHS of these productions appear on LHS. In our implementation, we efficiently do this by maintaining data structures relating non-terminals and the productions they appear in, and relating the productions and the derivations they appear in. We are not describing those details here due to lack of space. As mentioned before, while the cost may decrease for some reasons, it could also increase for other reasons. Hence an application of an instance of this operator can also increase the overall cost.
It is important to mention that application of this operator can only increase the coverage of the grammar. For example, given productions NNS → apple, V B → eat and V P → V B NNP, but not a production V P → V B NNS, then "eat apple" cannot be parsed into into V P. However, merging NNP and NNS into M 1 will result in new productions M 1 → apple and V P → V B M 1 which will parse "eat apple" into V P. Hence this operator generalizes the grammar.
Our method follows a greedy search procedure to find the grammar which results in the minimum overall cost of the grammar and the derivations (equation 5). Given a set of unannotated training sentences, it starts with the trivial, overly specific, extreme type of grammar in which a production is included for each unique sentence in the training set, as mentioned before. Next, all applicable instances of both the combine and merge operators are considered and the reduction in cost upon applying them is estimated. The instance of the operator which results into the greatest cost reduction is then applied. This process continues iteratively until no instance of the operator results in any decrease in cost. The resultant grammar is then returned as the induced grammar.
In order to find all the applicable instances of the combine operator, all "n-grams" of the non-terminals on the RHS are considered (the maximum value of n was 4 in the experiments). There is no reason to consider an exponentially large number of every combination of non-terminals which do not even appear once in the grammar. However, in order to find all the applicable instances of the merge operator, there is no such simple way but to consider merging every two non-terminals in the grammar (it is not obvious that any other way will be significantly more efficient with regards to estimating the reductions in cost). The start symbol of the grammar is preserved and is not merged with any other symbol. Note that this search procedure is greedy and may only give an approximate solution which could be a local minima.
Obtaining production probabilities
The method described in the previous subsections induces a grammar but does not give the probabilities associated with its productions. If there are multiple ways in which a sentence can be parsed using a grammar, then having probabilities associated with its productions provide a principled way to choose one parse over another in a probabilistic context-free grammar parsing setting . In this subsection, we describe an augmentation to our method to obtain these probabilities using an instance of the expectation-maximization (EM)  algorithm. As an initialization step of this algorithm, the probabilities are first uniformly assigned to all the productions that expand a non-terminal so that they sum up to a total of one. For example, if there are four productions that expand a non-terminal, say NP, then all those four productions will be assigned an equal probability of 0.25. Next, using these probabilities, the training sentences are parsed and the most probable parse is obtained for each of them. In the implementation, we used a probabilistic version  of the well known Earley's parsing algorithm for context-free grammars . In the following iteration, assuming as if these parses are the correct parses for the sentences, the method counts how many times a production is used in the parses and how many times its LHS non-terminal is expanded in them. The corresponding fraction is then assigned as the probability of that production, similar to the way probabilities are computed in a supervised parsing setting from sentences annotated with correct parses. Using these as the new probabilities, the entire process is repeated under a new iteration. Experimentally, we found that this process converges within five iterations. Instead of only choosing the most probable parse for every sentence in each iteration, we also experimented with choosing all parses for a sentence and counting fractional counts proportional to the probabilities of the parses. However, this did not make any big difference.
To create a dataset, we took the first 5000 sentences from the discharge summaries section of the Pittsburgh corpus  using Stanford CoreNLP's sentence segmentation utility. We ran MetaMap  on these sentences to get part-of-speech tags and UMLS semantic types of words and phrases. MetaMap appeared to run endlessly on some long sentences hence we restricted to sentences with maximum length 20 (i.e. all 5000 sentences were of maximum length 20). Since many UMLS semantic types seemed very fine grained, we chose only 27 of them which seemed relevant for clinical reports (these included "disease or syndrome", "finding", "body part, organ, or organ component", "pathologic function", "medical device" etc.). All the occurrences of these semantic types were substituted for the actual words and phrases in the sentence. Figure 1(a) shows an original sentence from the corpus and 1(b) shows the same sentence in which some words and phrases have been substituted by their UMLS semantic types. Figure 1(c) shows the part-of-speech tags of the words of the original sentence as obtained by MetaMap. Note that the part-of-speech tags that MetaMap outputs are not as fine-grained as are in the Penn treebank. Also note that the entire last phrase "ventricular assist device" is tagged as a single noun. Finally, the words which were not replaced by the chosen UMLS semantic types were replaced by their part-of-speech tags.
Figure 1(d) shows the original sentence from 1(a) with words and phrases replaced by part-of-speech tags and UMLS semantic types. We ran all our experiments on sentences transformed into this final form. We note that our experiments are obviously limited due to the accuracy of MetaMap in determining the correct part-of-speech tags and UMLS semantic types.
We separated 1000 sentences from the 5000 sentences and used them as test sentences to determine how well the induced grammar works on novel sentences. The rest were used as the training data to induce grammar. Out of these 1000 test sentences, we manually put correct parsing brackets on 100 sentences to test the performance of the parses obtained by the induced grammar. We are not aware of any annotated corpus of clinical report domain which we could have used to measure this performance.
Results and discussion
An obvious future task is to apply this approach to other genres of clinical report present in the Pittsburgh corpus. We, in fact, already did this, except for manually creating a corresponding bracket-annotated corpus for measuring parsing performance. Also, a bigger annotated corpus for evaluating the current results on discharge summaries genre is desirable. Another avenue of future work is to improve the search procedure for finding the optimum grammar. One way would be to do a beam search. Besides using the UMLS semantic types, in future, one may decide additional semantic types which could help in some application, for example, a negation class of words, a class of words representing patients etc. Currently the method first induces the grammar and then estimates the probabilities of its productions from the same data. An interesting possibility for future work will be to integrate the two steps so that the probabilities are computed and employed even during the grammar induction process. This will be a more elegant method and will likely lead to an improvement in the parsing performance.
Unsupervised parsing is particularly suitable for clinical domains because it does not require expensive annotation effort to adapt to different genres and styles of clinical reporting. We presented an unsupervised approach for inducing grammar for clinical report sublanguage in terms of part-of-speech tags and UMLS semantic types. We showed that using the cost-reduction principle, the approach is capable of learning a range of grammars from very specific to very general and achieves the best parsing performance in between.
- Ge R, Mooney RJ: Learning a Compositional Semantic Parser using an Existing Syntactic Parser. Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2009), Suntec, Singapore. 2009, 611-619.Google Scholar
- Kate RJ: Transforming Meaning Representation Grammars to Improve Semantic Parsing. Proceedings of the Twelfth Conference on Computational Natural Language Learning (CoNLL 2008), Manchester, UK. 2008, 33-40.View ArticleGoogle Scholar
- Langley P, Stromsten S: Learning Context-Free Grammar with a Simplicity Bias. Proceedings of the 11th European Conference on Machine Learning (ECML-00), Barcelona, Spain. 2000, 220-228.Google Scholar
- Seginer Y: Fast Unsupervised Incremental Parsing. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL-2007), Prague, Czech Republic. 2007, 384-391.Google Scholar
- Ponvert E, Baldridge J, Erk K: Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), Portland, Oregon, USA. 2011, 1077-1086.Google Scholar
- Klein D, Manning C: A Generative Constituent-Context Model for Improved Grammar Induction. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-2002), Philadelphia, PA. 2002Google Scholar
- Klein D, Manning CD: Corpus-Based Induction of Syntactic Structure: Models of Dependency and Constituency. Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), Barcelona, Spain. 2004, 479-486.Google Scholar
- Dempster AP, Laird NM, Rubin DB: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society B. 1977, 39: 1-38.MATHMathSciNetGoogle Scholar
- Chapman WW, Saul M, Houston J, Irwin J, Mowery D, Karkeme H, Becich M: Creation of a Repository of Automatically De-Identified Clinical Reports: Processes, People, and Permission. AMIA Summit on Clinical Research Informatics, San Francisco, CA. 2011Google Scholar
- Harris Z: The Form of Information in Science: Analysis of an Immunology Sublanguage. 1989, Kluwer AcademicGoogle Scholar
- Sager N, Friedman C, Lyman MS: Medical Language Processing: Computer Management of Narrative Data. 1987, Reading, MA: Addison-WesleyGoogle Scholar
- Friedman C, Alderson PO, Austin JHM, Cimino J, Johnson SB: A general natural language text processor for clinical radiology. Journal of the American Medical Informatics Association. 1994, 1: 161-174. 10.1136/jamia.1994.95236146.View ArticleGoogle Scholar
- Bodenreider O: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research. 2004, 32: D267-D270. 10.1093/nar/gkh061.View ArticleGoogle Scholar
- Aronson AR, Lang FM: An Overview of MetaMap: Historical Perspectives and Recent Advances. Journal of the American Medical Informatics Association. 2010, 17: 229-236.View ArticleGoogle Scholar
- Wolff JG: Language Acquisition, Data Compression, and Generalization. Language and Communication. 1982, 2: 57-89. 10.1016/0271-5309(82)90035-0.View ArticleGoogle Scholar
- Chen TH, Tseng CH, Chen CP: Automatic Learning of Context-Free Grammar. Proceedings of the 18th Conference on Computational Linguistics and Speech Processing. 2006Google Scholar
- Jurafsky D, Martin JH: Speech and Language Processing: An Introduction to Natural Lan guage Processing, Computational Linguistics, and Speech Recognition. 2008, Upper Saddle River, NJ: Prentice HallGoogle Scholar
- Stolcke A: An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities. Computational Linguistics. 1995, 21 (2): 165-201.MathSciNetGoogle Scholar
- Earley J: An Efficient Context-Free Parsing Algorithm. Communications of the Association for Computing Machinery. 1970, 6 (8): 451-455.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.