FlexiTerm: a flexible term recognition method

Spasić, Irena; Greenwood, Mark; Preece, Alun; Francis, Nick; Elwyn, Glyn

doi:10.1186/2041-1480-4-27

Research
Open access
Published: 10 October 2013

FlexiTerm: a flexible term recognition method

Irena Spasić¹,
Mark Greenwood¹,
Alun Preece¹,
Nick Francis² &
…
Glyn Elwyn^2,3

Journal of Biomedical Semantics volume 4, Article number: 27 (2013) Cite this article

7035 Accesses
27 Citations
1 Altmetric
Metrics details

Abstract

Background

The increasing amount of textual information in biomedicine requires effective term recognition methods to identify textual representations of domain-specific concepts as the first step toward automating its semantic interpretation. The dictionary look-up approaches may not always be suitable for dynamic domains such as biomedicine or the newly emerging types of media such as patient blogs, the main obstacles being the use of non-standardised terminology and high degree of term variation.

Results

In this paper, we describe FlexiTerm, a method for automatic term recognition from a domain-specific corpus, and evaluate its performance against five manually annotated corpora. FlexiTerm performs term recognition in two steps: linguistic filtering is used to select term candidates followed by calculation of termhood, a frequency-based measure used as evidence to qualify a candidate as a term. In order to improve the quality of termhood calculation, which may be affected by the term variation phenomena, FlexiTerm uses a range of methods to neutralise the main sources of variation in biomedical terms. It manages syntactic variation by processing candidates using a bag-of-words approach. Orthographic and morphological variations are dealt with using stemming in combination with lexical and phonetic similarity measures. The method was evaluated on five biomedical corpora. The highest values for precision (94.56%), recall (71.31%) and F-measure (81.31%) were achieved on a corpus of clinical notes.

Conclusions

FlexiTerm is an open-source software tool for automatic term recognition. It incorporates a simple term variant normalisation method. The method proved to be more robust than the baseline against less formally structured texts, such as those found in patient blogs or medical notes. The software can be downloaded freely at http://www.cs.cf.ac.uk/flexiterm.

Background

Terms are means of conveying scientific and technical information [1]. More precisely, terms are linguistic representations of domain-specific concepts [2]. For practical purposes, they are often defined as phrases (typically nominal [3, 4]) that frequently occur in texts restricted to a specific domain and have special meaning in a given domain. Terms are distinguished from other salient phrases by the measures of their unithood and termhood [4]. Unithood is defined as the degree of collocational stability (each term has a stable inner structure), while termhood refers to the degree of correspondence to domain-specific concepts (each term corresponds to at least one domain-specific concept). Termhood implies that terms carry heavier information load compared to other phrases used in a sublanguage, and as such they can be used to: provide support for natural language understanding, correctly index domain-specific documents, identify text phrases to be useful for automatic summarisation of domain-specific documents, efficiently skim through documents obtained through information retrieval, identify slot fillers for the information extraction tasks, etc. It is, thus, essential to build and maintain terminologies in order to enhance the performance of many natural language processing (NLP) applications.

Automatic term recognition

Bearing in mind the potentially unlimited number of different domains and the dynamic nature of some domains (many of which expand rapidly together with the corresponding terminologies [5, 6]), the need for efficient term recognition becomes apparent. Manual term recognition approaches are time-consuming, labour-intensive and prone to error due to subjective judgement. Therefore, automatic term recognition (ATR) methods are needed to efficiently annotate electronic documents with a set of terms they mention [7]. Note that here ATR refers to automatic extraction of terms from a domain-specific corpus [2] rather than matching a corpus against a dictionary of terms (e.g. [8]). Dictionary-based approaches are too static for dynamic domains such as biology or the newly emerging types of media such as blogs, where lay users may discuss topics from a specialised domain (e.g. medicine), but may not necessarily use a standardised terminology. Therefore, many biomedical terms cannot be identified in text using a dictionary look-up approach [9]. It is also important to differentiate between two related problems: ATR and keyphrase extraction. Both approaches aim to extract terms from text. The ultimate goal of ATR is to extract all terms from a corpus of documents, whereas keyphrase extraction targets only those terms that can summarise and characterise a single document. The two tasks will have similar approaches to candidate selection (e.g. noun phrases), after which the respective methods will diverge. Keyphrase extraction typically relies on supervised machine learning [10, 11], while ATR is more likely to use unsupervised methods in order to explore the terminology space.

Manual term recognition is performed by relying on the conceptual knowledge, where human experts use tacit knowledge to identify terms by relating them to the corresponding concepts. On the other hand, ATR approaches resort to other types of knowledge that can provide clues about the terminological status of a given natural language utterance [12], e.g. morphological, syntactic, semantic and/or statistical knowledge about terms and/or their constituents (nested terms, words, morphemes). In general, there are two basic approaches to ATR [3]: linguistic (or symbolic) and statistical.

Linguistic approaches to ATR rely on the recognition of term formation patterns, but patterns alone are not sufficient for discriminating between terms and non-terms, i.e. there is no lexico-syntactic pattern according to which it could be inferred whether a phrase matching it is a term or not [2]. However, they provide useful clues that can be used to identify term candidates if not terms themselves. Linguistic ATR approaches usually involve pattern–matching algorithms to recognise candidate terms by checking if their internal syntactic structure conforms to a predefined set of morpho-syntactic rules [13], e.g. cyclic/JJ adenosine/NN monophosphate/NN matches the pattern (JJ | NN)⁺NN (JJ and NN are part-of-speech tags used to denote adjectives and nouns respectively). Others simply focus on noun phrases of certain length: 2 (word bigrams), 3 (word trigrams) and 4 (word quadgrams) [14]. However, both approaches depend strongly on the ability to reliably identify noun phrases, a task that has proven to be problematic in the biological domain mainly due to the lack of highly accurate part-of-speech (POS) taggers for biomedical text [15].

Statistical ATR methods rely on the following hypotheses regarding the usage of terms [4]: specificity (terms are likely to be confined to a single or few domains), absolute frequency (terms tend to appear frequently in their domain), and relative frequency (terms tend to appear more frequently in their domain than in general). In most of the methods, two types of frequencies are used: frequency of occurrence in isolation and frequency of co-occurrence. One of the measures that combines this information is mutual information, which can be used to measure the unithood of a candidate term, i.e. how strongly its constituents are associated with one another [16]. Similarly, the Tanimoto's coefficient can be used to locate the words that appear more frequently in co-occurrence than isolated [17]. Statistical approaches are prone to extracting not only terms, but also other types of collocations: functional, semantic, thematic and other [18]. This problem is typically remedied by employing linguistic filters in the form of morpho-syntactic patterns in order to extract candidate terms from a corpus, which are then ranked using statistical information. A popular example of such an approach is C-value [19], a method which combines linguistic knowledge and statistical analysis. First, POS tagging is performed, since the syntactic information is needed in order to apply syntactic pattern matching against a corpus. The role of these patterns is to extract only those words sequences that conform to syntactic rules that describe a typical inner structure of terms. In the statistical part of the C-value method, each term candidate is quantified by its termhood following the idea of a cost-criteria based measure originally introduced for automatic collocation extraction [20]. C-value is calculated as a combination of the term’s numerical characteristics: length as the number of tokens, absolute frequency and two types of frequencies relative to the set of candidate terms containing the nested candidate term (frequency of occurrence nested inside other candidate terms and the number of different term candidates containing the nested candidate term). Formally, if T is a set of all candidate terms, t ∈ T, | t | is the number of words in t, f: T → N is the frequency function, P(T) is the power set of T, S: T → P(T) is a function that maps a candidate term to the set of all other candidate terms containing it as a substring, then the termhood, denoted as C-value(t), is calculated as follows:

C - value (t) = \{_{In |t| \cdot (f (t) - \frac{1}{|S (t)|} \sum_{s \in S (t)} f (s))}^{In |t| \cdot f (t)}_{, if S (t) \neq \emptyset}^{, if S (t) = \emptyset}

(1)

The method favours longer, more frequently and independently occurring term candidates. Better results have been reported when the limited paradigmatic modifiability was used as a measure of termhood, which is based on the probability with which specific slots in a term candidate can be filled by other tokens, i.e. the tendency not to let other tokens occur in particular slots [14].

Term variation

Both methods will perform well to identify terms that are used consistently in the corpus, i.e. where their occurrences do not vary in structure and content. However, terms typically vary in several ways:

morphological variation, where the transformation of the content words involves inflection (e.g. lateral meniscus vs. lateral menisci) or derivation (e.g. meniscal tear vs. meniscus tear),
syntactic variation, where the content words are preserved in their original form (e.g. stone in kidney vs. kidney stone),
semantic variation, where the transformation of the content words involves a semantic relation (e.g. dietary supplement vs. nutritional supplement).

It is estimated that approximately one third of an English scientific corpus accounts for term variants, the majority of which (approximately 59%) are semantic variants, while morphological and syntactic variants account for around 17% and 24% respectively [1]. The large number of term variants emphasises the necessity for ATR to address the problem of term variation. In particular, statistically based ATR methods should include term normalisation (the process of associating term variants with one another) in order to aggregate occurrence frequencies at the semantic level rather than dispersing them across separate variants at the linguistic level [21].

Lexical programs distributed with the UMLS knowledge sources [22] incorporate an effective method for neutralising term variation [23]. Orthographic, morphological and syntactic term variants are normalised simply by tokenising each term, lowercasing each token, converting each word to its base form (lemmatisation), ignoring punctuation, ignoring tokens shorter than three characters, removing stop words (i.e. common English words such as of, and, with etc.) and sorting the remaining tokens alphabetically. For example, the genitive (possessive) forms are neutralised by this approach: Alzheimer’s disease is first tokenised to (Alzheimer,’ , s, disease), then lowercased (alzheimer,’ , s, disease), after which punctuation and short tokens are removed, and the remaining tokens finally sorted to obtain the normalised term representative (alzheimer, disease). The normalisation of the variant Alzheimer disease results in the same normalised form, so the two variants are matched through their normalised forms. Similarly, the genitive usage of the preposition of can be neutralised. For example, aneurysm of splenic artery and splenic artery aneurysm share the same normalised form. Note that such an approach may lead to overgeneralisation, e.g. Venetian blind and blind Venetian vary only in order, but have unrelated meanings. However, few such examples have been reported in practice [23]. Derivational and inflectional variation of individual tokens is addressed by rules which define mapping between suffixes across different lexical categories. For example, the rule –a|NN|–al|JJ maps between nouns ending with –a and adjectives ending with –al that match on the remaining parts (e.g. bacteria and bacterial), while the rule –us|NN|–i|NN matches inflected noun forms that end with –us and –i (e.g. fungus and fungi).

Methods

Method overview

FlexiTerm is an open-source, stand-alone application developed to address the task of automatically identifying terms in textual documents. Similarly to C-value [24], our approach performs term recognition in two stages. First, lexico–syntactic information is used to select term candidates, after which term candidates are scored using a formula that estimates their collocation stability, but taking into account possible syntactic, morphological, derivational and orthographic variation. What differentiates FlexiTerm from C-value is the flexibility with which term candidates are compared to one another. Namely, C-value relies on exact token matching to measure the overlap between term candidates in order to identify the longest collocationally stable phrases, also taking into account the exact order in which these tokens occur. The order condition has been relaxed in later versions of C-value in order to address the term variation problem using transformation rules to explicitly map between different types of syntactic variants (e.g. stone in kidney is mapped to kidney stone using the rule NN₁PREP NN₂ → NN₂NN₁) [25]. FlexiTerm uses flexible comparison of term candidates by treating them as bags of words, thus completely ignoring the order of tokens, following a more pragmatic approach to neutralising term variation, which has been successfully used in practice [23] (see the Background section for details). Still, the C-value approach relies on exact token matching, which may be too rigid for types of documents that are prone to typographical errors and spelling mistakes, e.g. medical notes [26] and patient blogs [27]. Therefore, FlexiTerm adds additional flexibility to term candidate comparison by allowing approximate token matching based on lexical and phonetic similarity, which often indicates not only semantically equivalent words (e.g. hemoglobin vs. haemoglobin), but also semantically related ones (e.g. hypoglycemia vs. hyperglycemia).

Edit distance (ED) has been widely applied in NLP for approximate string matching, where the distance between identical strings is equal to zero and it increases as the strings get more dissimilar with respect to the characters they contain and the order in which they appear. ED is defined as the minimal number (or cost) of changes needed to transform one string into the other. These changes may include the following edit operations: insertion of a single character, deletion of a single character, replacement (substitution) of two corresponding characters in the two strings being compared, and transposition (reversal or swap) of two adjacent characters in one of the strings [28]. This approach has been successfully utilised in NLP applications to deal with alternate spellings, misspellings, the use of white spaces as means of formatting, the use of upper- and lower-case letters and other orthographic variations. For example, 80% of the spelling mistakes can be identified and corrected automatically by considering a single omission, insertion, substitution or reversal [28]. ED can be practically computed using a dynamic programming approach [29]. FlexiTerm applies ED to improve token matching, thus allowing different morphological, derivational and orthographic variants together with statistical information attached to them to be aggregated.

Linguistic pre-processing

Our approach to ATR takes advantage of lexico–syntactic information to identify term candidates. Therefore, the input documents need to undergo linguistic pre–processing in order to annotate them with relevant lexico–syntactic information. This process includes sentence splitting, tokenisation and POS tagging. Practically, text is first processed using the Stanford log-linear POS tagger [30, 31], which splits text into sentences and tokens, which are then annotated with POS information, i.e. lexical categories such as noun, verb, adjective, etc. The output of linguistic pre-processing is a document in which sentences and lexical categories of individual tokens (e.g. nouns, verbs, etc.) are marked up. We used the Penn Treebank tag set [32] throughout this article (e.g. NN, JJ, NP, etc.).

Term candidate extraction and normalisation

Once input documents have been pre-processed, term candidates are extracted by matching patterns that specify the syntactic structure of targeted noun phrases (NPs). These patterns are the parameters of the method and may be modified if needed. In our experiments, we used the following three patterns:

1.
(JJ | NN)⁺ NN, e.g. chronic obstructive pulmonary disease
2.
(NN | JJ)* NN POS (NN | JJ)* NN, e.g. Hoffa's fat pad
3.
(NN | JJ)* NN IN (NN | JJ)* NN, e.g. acute exacerbation of chronic bronchitis

Further, lexical information is used to improve boundary detection of term candidates by trimming leading and trailing stop words, which include common English words (e.g. any), but also frequent modifiers of biomedical terms (e.g. small in small Baker's cyst).

In order to neutralise morphological and syntactic variation, all term candidates are normalised. The normalisation process is similar to the one described in [23] and consists of the following steps: (1) Remove punctuation (e.g. ' in possessives), numbers and stop words including prepositions (e.g. of) (2) Remove any lowercase tokens with ≤2 characters. (3) Stem each remaining token. For example, this process would map term candidates such as hypoxia at rest and resting hypoxia to the same normalised form {hypoxia, rest}, thus neutralising both morphological and syntactic variation resulting in two linguistic representations of the same medical concept. The normalised candidate is used to aggregate the relevant information associated with the original candidates, e.g. their frequency of occurrence. This means that subsequent calculation of termhood is performed against normalised term candidates.

It should be noted that the step 2 removes only lowercase tokens. This approach effectively removes possessive s in Baker's cyst, but not D in vitamin D as uppercase tokens generally convey more important information, which is therefore preserved in this approach. Also note that removing tokens longer than 2 characters would be too aggressive in deleting not only possessives and some prepositions (e.g. of), but also essential term constituents as it would be the case with fat pad, in which both tokens would be lost, thus completely ignoring it as a potential term.

Token-level similarity

While many types of morphological variation are effectively neutralised with stemming used as part of the normalisation process (e.g. transplant and transplantation will be reduced to the same stem), exact token matching will still fail to match synonyms that differ due to orthographic variation (e.g. haemorrhage and hemorrhage are stemmed to haemorrhag and hemorrhag respectively). On the other hand, such variations can be easily identified using approximate string matching. For example, the ED between the two stems is only 1 – a single insertion of the character a: h[a]emorrhag. In general, token similarity can be used to boost the termhood of related terms by aggregating statistical information attached to them. For example, when terms such as asymptomatic HIV infection and symptomatic HIV infection are considered separately, the frequency of nested term HIV infection, which also occurs independently, will be much greater than that of either of the longer terms. This introduces a strong bias towards shorter terms (often a hypernym of the longer terms), which may cause longer terms not to be identified as such, thus overgeneralising the semantic content. However, the lexical similarity between the constituent tokens asymptomatic and symptomatic (one deletion operation) combined with the other two identical tokens indicates high similarity between the candidate terms, which can be used to aggregate the associated information and reduce the bias towards shorter terms.

The normalisation process continues by expanding previously normalised term candidates with similar tokens found in the corpus. In the previous example, the two normalised candidates {asymptomat, hiv, infect} and {symptomat, hiv, infect} would both be expanded to the same normalised form {asymptomat, symptomat, hiv, infect}. In our implementation, similar tokens are identified based on their phonetic and lexical similarity calculated with Jazzy [33] (a spell checker API). Jazzy is based on ED [28] described earlier in more detail, but it also includes two more edit operations to swap adjacent characters and to change the case of a letter. Apart from string similarity, Jazzy supports phonetic matching with the Metaphone algorithm [34], which aims to match words that sound similar without necessarily being lexically similar. This capability is important in dealing with new phenomena such as SMS language, in which the original words are often replaced by phonetically similar ones to achieve brevity (e.g. l8 and late). This phenomenon is becoming increasingly present in online media (e.g. patient blogs) and needs to be taken into account in modern NLP applications.

Termhood calculation

The termhood calculation is based on the C-value formula given in (1) [19]. A major difference in relation to the original C-value method is the way in which term candidates are normalised. In the C-value approach the notion of nestedness, as part of determining the set S(t), is based on substrings nested in a term candidate t treated as a string. In our approach, a term candidate is treated as a bag of words, which allows nestedness to be determined using subsets instead of substrings. This effectively bypasses the problem of syntactic variation, where individual tokens do not need to appear in the same order (e.g. kidney stone vs. stone in kidney). Other causes of term variability (mainly morphological and orthographic variation) are addressed by automatically adding similar tokens to normalised term candidates, which means that nestedness can be detected between lexically similar phrases using the subset operation. For example, exact matching would fail to detect posterolateral corner as nested in postero-lateral corner sprain because of hyphenation (a special case of orthographic variation). In our approach, these two term candidates would be represented as {postero-later, posterolater, corner} and {postero-later, posterolater, corner, sprain} respectively, where similar stems postero-later and posterolater have been automatically detected in the corpus and used to expand normalised term candidates. In this case, nestedness is detected by simply checking the following condition: {postero-later, posterolater, corner} ⊆ {postero-later, posterolater, corner, sprain}.

The FlexiTerm method is summarised with the following pseudocode:

1.
Pre-process text to annotate it with lexico-syntactic information.
2.
Select term candidates using pattern matching on POS tagged text.
3.
Normalise term candidates by performing the following steps.
1. a.
  Remove punctuation, numbers and stop words.
2. b.
  Remove any lowercase tokens with ≤2 characters.
3. c.
  Stem each remaining token.
4.
Extract distinct token stems from normalised term candidates.
5.
Compare token stems using lexical and phonetic similarity calculated with Jazzy API.
6.
Expand normalised term candidates by adding similar token stems determined in step 5.
7.
For each normalised term candidate t:
1. a.
  Determine set S(t) of all normalised term candidates that contain t as a subset.
2. b.
  Calculate C-value(t) according to formula (1).
8.
Rank normalised term candidates using their C-value.

Output

Once terms are recognised, FlexiTerm produces output that can be used by either a human user or other NLP applications. Three types of output are produced: (1) a ranked list of terms with their termhood scores presented as table in the HTML format, (2) a plain list of terms that can be utilised as a lexicon by other NLP applications, and (3) a list of regular expressions in Mixup (My Information eXtraction and Understanding Package), a simple pattern-matching language [35]. Figure 1 shows a portion of the HTML output in which term variants with the same normalised form are grouped together and assigned a single termhood score. Lowercased term variants are given as they occurred in the corpus and are ordered by their frequency of occurrence. In effect, the plain text output presents the middle column of the HTML output. The term list can be utilised in a dictionary matching approach (e.g. [36]) to annotate all term occurrences in a corpus. Rather than annotating occurrences in text, we opted for this approach as it is more flexible and avoids conflict with other annotations produced by other applications. Still, for quick overview of terms and the context in which they appeared, the Mixup output can be used by MinorThird, a collection of Java classes for annotating text [35], to visualise the results (see Figure 2) and save the stand-off annotations, which include document name, start position of a term occurrence and its length.

Results

Data

FlexiTerm is a domain independent ATR method, that is – it does not rely on any domain specific knowledge (e.g. rules or dictionaries) to recognise terms in a domain specific corpus. A comprehensive study of subdomain variation in biomedical language has highlighted significant implications for NLP applications, in particular standard training and evaluation procedures for biomedical NLP tools [37]. This study revealed that the commonly used molecular biology subdomain is not representative of the overall biomedical domain, meaning that the results obtained using a corpus from this subdomain (e.g. [38]) cannot be generalised in terms of expecting comparable performance with other types of biomedical text. In particular, a comparative evaluation of ATR algorithms indicated that choice, design, quality and size of corpora have a significant impact on their performance [39]. Therefore, in order to demonstrate the portability of our method across sublanguages, i.e. languages confined to specialised domains [40], we used multiple data sets from different biomedical subdomains (e.g. molecular biology, medical diagnostic imaging or respiratory diseases) as well as text written by different types of authors and/or aimed at different audience (e.g. scientists, healthcare professionals or patients). We used five data sets (see Tables 1 and 2 for basic description).

Table 1 Data sets used in evaluation

FlexiTerm: a flexible term recognition method

Abstract

Background

Results

Conclusions

Background

Automatic term recognition

Term variation

Methods

Method overview

Linguistic pre-processing

Term candidate extraction and normalisation

Token-level similarity

Termhood calculation

Output

Results

Data

Gold standard

Evaluation measures

Evaluation results and discussion

Computational efficiency

Conclusions

Availability and requirements

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Biomedical Semantics

Contact us