Mining characteristics of epidemiological studies from Medline: a case study in obesity

Background The health sciences literature incorporates a relatively large subset of epidemiological studies that focus on population-level findings, including various determinants, outcomes and correlations. Extracting structured information about those characteristics would be useful for more complete understanding of diseases and for meta-analyses and systematic reviews. Results We present an information extraction approach that enables users to identify key characteristics of epidemiological studies from MEDLINE abstracts. It extracts six types of epidemiological characteristic: design of the study, population that has been studied, exposure, outcome, covariates and effect size. We have developed a generic rule-based approach that has been designed according to semantic patterns observed in text, and tested it in the domain of obesity. Identified exposure, outcome and covariate concepts are clustered into health-related groups of interest. On a manually annotated test corpus of 60 epidemiological abstracts, the system achieved precision, recall and F-score between 79-100%, 80-100% and 82-96% respectively. We report the results of applying the method to a large scale epidemiological corpus related to obesity. Conclusions The experiments suggest that the proposed approach could identify key epidemiological characteristics associated with a complex clinical problem from related abstracts. When integrated over the literature, the extracted data can be used to provide a more complete picture of epidemiological efforts, and thus support understanding via meta-analysis and systematic reviews.


Background
Epidemiological studies aim to discover the patterns and determinants of diseases, and other health related states by studying the health of populations in standardised ways. They are valuable sources of evidence for public health measures and for shaping of research questions in the clinical and biological aspects of complex diseases. Nevertheless, the increasing amount of published literature leads to information overload, making the task of reading and integrating relevant knowledge a challenging process [1][2][3]. For example, there are more than 23,000 obesity-related articles reporting on different epidemiological findings, including almost 3,000 articles with obesity/epidemiology as a MeSH descriptor in 2012, with more than 15,000 such articles in the last 10 years. Therefore, there is a need for systems that enable the extraction of salient epidemiological study features in order to assist investigators to reduce the time required to detect, summarise and incorporate epidemiological information from the relevant literature [4].
Epidemiology is a relatively structured field with its own dictionary and reporting style, deliberately written in a typical semi-structured format in order to standardize and improve study design, communication and collaboration. The standard characteristics in most epidemiological studies include [5]: study design -a specific plan or protocol that has been followed in the conduct of the study; population -demographic details of the individuals (e.g., gender, age, ethnicity, nationality) participating in an epidemiological study; exposure -a factor, event, characteristic or other definable entity that brings about change in a health condition or in other defined characteristics; outcome -the consequence from the exposure in the population of interest; covariate -a concept that is possibly predictive of the outcome under study; effect size -the measure of the strength of the relationship between variables, that relates outcomes to exposures in the population of interest.
In this paper we present a system that enables the identification and retrieval of the key characteristics from the epidemiological studies. We have applied the system to the obesity epidemiological literature. Obesity is one of the most important health problems of the 21 st century [6], presenting a great public health and economic challenge [7][8][9]. The rapid and worldwide spread of obesity has affected people of all ages, genders, geographies and ethnicities. It has been regarded as a multidimensional disorder [10], with major behavioural and environmental determinants, with genetics playing only a minor role [7].

Related work
In the last decade, a significant amount of research has been performed on the extraction of information in the biomedical field, especially on the identification of biological [11,12] and clinical concepts [13,14] in the literature. In clinical text mining, several attempts have been made to extract various kinds of information from case studies and clinical trials in particular [1][2][3][4][15][16][17][18][19][20][21][22][23]. For example, De Bruijn et al. [22] applied text classification with a "weak" regular expression matcher on randomized clinical trial (RCT) reports for the recognition of key trial information that included 23 characteristics (e.g. eligibility criteria, sample size, route of treatment, etc.) with overall precision of 75%. The system was further expanded to identify and extract specific characteristics such as primary outcome names and names of experimental treatment from journal articles reporting RCTs [4], with precision of 93%. However, they focused solely on RCTs and especially on randomized controlled drug treatment trials. Hara and Matsumoto [1] extracted information about the design of phase III clinical trials. They extracted patient population and compared associated treatments through noun phrase chunking and categorisation along with regular expression pattern matching. They reported precision for population and compared treatments of 80% and 82% respectively. Hansen et al. [2] worked on RCTs identifying the numbers of the trial participants through a support vector machine algorithm with 97% precision, while Fizman et al. [19] aimed to recognize metabolic syndrome risk factors in MEDLINE citations through automatic semantic interpretation with 67% precision. However, to the best of our knowledge, there is no approach available for recognising key information elements from various types of epidemiological studies that are related to a particular health problem.

Methods
Our approach involved the design and implementation of generic rule-based patterns, which identify mentions of particular characteristics of epidemiological studies in PubMed abstracts (Figure 1). The rules are based on patterns that were engineered from a sample of 60 epidemiological abstracts in the domain of obesity. Mentions of six semantic types (study design, population, exposures, outcomes, covariates and effect size) have been manually identified and reviewed. Additionally, a development set with additional 30 abstracts was used to optimise the performance of the rules. These steps are explained here in more details.
1. Abstract selection and species filtering. In the first step, abstracts are retrieved from PubMed using specific MeSH terms (e.g. obesity/epidemiology [mesh]). They are checked by LINNAEUS, a species identification system [24], to filter out studies based on non-human species. 2. Building of dictionaries of potential mentions. In the second step, a number of semantic classes are identified using custom-made vocabularies that include terms to detect key characteristics in epidemiological study abstracts (e.g. dictionaries of words that indicate tudy design, population totals, etc.a total of fourteen dictionaries). We also identify mentions of Unified Medical Language System (UMLS) [25] terms and additionally apply the Specialist lexicon [26] in order to extract potential exposure, outcome, covariate and population concepts. Finally, epidemiological abstracts are processed with an automatic term recognition (ATR) method for the extraction of multi-word candidate concepts and their variants [27,28]. Filtering against a common stop-word list (created by Fox [29]) is applied to remove any concepts of non-biomedical nature. 3. Mention-level application of rules. In the third step, rules are applied to the abstracts for each of the six epidemiological characteristic separately. The rules make use of two constituent types: frozen lexical expressions (used as anchors for specific categories) and specific semantic classes identified through the vocabularies (identified in step 2), which are combined using regular expressions. The frozen lexical expressions can contain particular verbs, prepositions or certain nouns.

Evaluation
We evaluated the system's performance at the document level by considering whether selected spans were correctly marked in text. We calculated precision, recall and F-score for each of the characteristic of interest Figure 1 The four steps of the approach applied to epidemiological abstracts in order to recognise key characteristics. Linnaeus is used to filter out abstracts not related to humans; Dictionary look-up and automatic term recognition (ATR) are applied to identify major medical concepts in text; MinorThird is used as an environment for the rule application and mention identification of epidemiological characteristics.
using the standard definitions [31]. In order to create an evaluation dataset, 60 abstracts were randomly selected from the PubMed results obtained by query obesity/ epidemiology[mesh] and manually double-annotated for all the six epidemiological characteristics by the first author and an external curator with epidemiological expertise.
The inter-annotator agreement of 80% was calculated on the evaluation dataset by the absolute agreement rate [32], suggesting relatively reliable annotations. Table 2 shows the results on the evaluation set, with to the results obtained on the training and development sets for comparison (Tables 3 and 4). The precision and The rule components in square brackets are the extracted spans that denote the key characteristic; the rest of the rule (if any) specifies the context. The rules use explicit matching of spans (e.g. eq('onset')), regular expressions (re) for matching specific verbs or prepositions (e.g. re('(of|on|in)')), various vocabularies that contain single (e.g. a(types)matching words that indicate the conduction of a study (e.g. study, analysis, review)) and multiword terms (e.g. @st, a vocabulary of epidemiological study designs (e.g. case control)). totals contains words that suggest the participant population; stats is a dictionary that contains numbers and words that express numeric values (e.g., one hundred); clusters includes the variations that a population sample can be described (e.g., men, patients, individuals); multiple contains single or multi-word biomedical concepts (e.g., depression, type 2 diabetes); relations is a dictionary with single words that describe an association between concepts (e.g., relationship, link, association); factors contains single or multi-word terms that describe risk factors (e.g., risk factors, predictors); or is a dictionary that contains noun phrases in which the effect size "odds ratio" can be expressed, including the ways in which its numeric value is presented (e.g., odds ratio = 1.34, or = 2.56); ci follows a similar pattern for confidence interval with its assigned numeric value e.g., (95% ci = 0.91, 95% ci: 4.36, 5.48).
recall values ranged from 79% to 100% and 80% to 100%, with F-measures between 82% and 96%. The best precision was observed for study design (100%). However, despite having a relatively large number of study design mentions in the training set (38 out of 60), the development and evaluation sets had notably fewer mentions and therefore the precision value should be taken with caution. Similarly, the system retrieved covariate characteristic with 100% recall, but again the number of annotated covariate concepts was low. The lowest precision was observed for outcomes (79%), while exposures had the lowest recall (80%). With the exception of study design that saw a little increase (7.7%), recall decreased for the rest of the characteristics when compared to the values on the development set. On the other hand, effect size had a notable increase in precision, from 75% (development) to 97% (evaluation). Overall, the micro F-score, precision and recall for all the six epidemiological characteristics were 87%, 88% and 86% respectively, suggesting reliable performance in the identification of epidemiological information from the literature.

Application to the obesity corpus
We applied the system on a large scale corpus consisting of 23,690 epidemiological PubMed abstracts returned by the obesity/epidemiology[mesh] query (restricted to English). We note that a number of returned MEDLINE citations did not contain any abstract, resulting in 19,188 processed citations. In total, we extracted 6,060 mentions of study designs; 13,537 populations; 23,518 exposures; 40,333 outcomes; 5,500 covariates and 9,701 mentions of effect sizes. Table 5 shows most frequent study types in obesity epidemiological research. The most common epidemiological study designs are cohort cross-sectional (n = 1,940; 32%) and cohort studies (n = 1876; 31% of all recognized

Discussion
Compared to other approaches that focused specifically on randomized clinical trials, our approach addresses a significantly more diverse literature space. We aimed at extracting key epidemiological characteristics, which are typically more complex than those presented in clinical trials. This is not surprising because clinical trials are subject to strict regulations and are reported in highly standardised ways. Although this makes it difficult to compare our results with those of others directly, we still note that our precision (79-100%) is comparable to other studies (67-93%). The overall F-score of 87% suggests that a rule-based approach can generate reliable results in epidemiological text mining despite the restrained nature of the targeted concepts. Here we discuss several challenges and issues related to epidemiological text mining, and indicate the areas for future work.

Complex and implicit expressions
Despite having relatively reliable annotations (recall the inter-annotator agreement of 80%), epidemiological abstracts feature a number of complex, varying detail and implicit expressions that are challenging for text mining. For example, there are various ways in which population can be described: from reporting age, sex and geographical region to mentioning the disease the individuals are currently affected with or that are excluded from the study (e.g. "The study comprised of 52 subjects with histologically confirmed advanced colorectal polyps and 53 healthy controls" [PMID -21235114]). Even more complex are the ways in which exposures are expressed, given that these are not often explicitly stated in text as exposures but rather part of the context of the study. Similarly, identification of covariate concepts is challenging as only a small number of covariates are explicitly stated in text. Finally, out dictionary coverage and focus were quite limited by design: we focused on biomedical concepts, but other types of concepts may be studied as determinants and outcomes, or being mentioned as covariates (e.g., "high school environmental activity"). While these have been addressed by application of ATR, more generic vocabularies may need to be used (see below for some examples).

Error analysis on the evaluation dataset
Our approach is based on intensive lexical and terminological pre-processing and rules to identify the key epidemiological characteristics. The number of rules designed for obesity can be considered relatively high (412), given that they were engineered from relatively small training (and development) datasets. On one hand, the number of rules for study design (16), covariate (28) and effect size (15) were rather small in comparison to others e.g., population (119), indicating the existence of generic expression patterns that can identify concept types from more generic epidemiological characteristics (such as study design or effect size). However, disease-related concepts often include a variety of determinants along with a number of outcomes of various nature (e.g. anatomical, biological, diseaserelated, etc.). Therefore, on the other hand, the task of recognizing these epidemiological elements (e.g., outcomes, exposures) through a rule based approach is not an easy task and requires a number of rules to accommodate different types of expression. We briefly discuss the cases of errors for each of the characteristic below.

Study design
Due to the limited number of study design mentions (only 13) in the evaluation set, the high values of precision, recall and F-score should be taken with caution. There were no false positives in the evaluation data set. However, it is possible that in a larger dataset, false positives could appear if certain citations report more than one mention of different study types. In addition, study designs without specific information can be ambiguous and thus were ignored (e.g. "Metabolic and bariatric surgery for obesity: a review [False Negative]").  Frequency is the number of documents, and the last column presents the share within the entire set.

Population
An analysis of false positives reveals that rules relying on the identification of prepositional phrases associated with populations (e.g. among and in) need more specific presence of patient-related concepts. False negatives included "3,715 deliveries" or "895 veterans who had bariatric surgery", which are referring to births and a specific demographic respectively, but our lexical resources did not contain those. Nevertheless, the F-score for the population type was the second best (93%), showing that a rule-based approach can be used to identify the participants in epidemiological studies. An interesting issue arose in the identification of population associated to meta-analyses. For example, the mention "included 3 studies involving 127 children" was identified by patterns but it is clear that a specific approach would be needed for meta-analysis studies.

Exposures and outcomes
While outcomes are often explicitly mentioned in text as such, exposure concepts are not, which makes the identification of exposures a particularly challenging task. Still, the use of dictionaries containing biomedical concepts for identification of potential mentions proved useful for capturing exposure concepts. However, dictionary-based look-up also contributed to incorrect exposure candidates that were extracted from non-relevant contexts. On the other hand, two frequent causes of errors could be linked to missing concepts from our dictionaries (e.g. "late bedtimes" or "costs")  and relatively complex exposure expressions (e.g. "level of PA during leisure"). An important source of errors was the confusion between exposures and outcomes, given they both refer to similar (semantic) types whose instances canin different studiesbe either exposure or outcome, and thus their role can be easily misinterpreted as an outcome rather than a studied determinant (and vice versa). We noted that rules such as "association between < exposure > and < outcome>" or "<exposure > associated with < outcome>" generated encouraging results i.e., a number of TPs. This was not surprising: when a clinical professional is studying the relationship between two concepts, he explores the link between an exposure and an outcome, which the above patterns capture. Still, sometimes these patterns would match links irrelevant to exposure/outcome relationships (e.g. "relationship between race and gender"). Cases like these result in the generation of both false positives and false negatives. Overall, a sentence-focused rule based method may struggle to understand a concept's role in a given case, and a wider context might need to be considered.

Covariates
Covariates had only a limited number of identified spans, hence any conclusion regarding the system's performance is at most indicative. Still, the results could provide an initial indication that (at least explicit) covariate mentions could be detected with good accuracy, despite some false positives (e.g. a generic mention "potential confounders" was identified as a covariate in "… after adjustment for potential confounders").

Effect size
The rules designed to recognize effect size spans were based on the combination of numerical and specific lexical expressions (e.g. "relative risk", "confidence interval"). A relatively high recall (87%) revealed that this approach returned promising results, with only a small number of mentions being ignored by the system, but with high precision. False negatives included expressions that included multiple values (e.g., "… increased risks of overweight/obesity at the age of 4 years (odds ratio (95% confidence interval): 15.01 (9.63, 23.38))", "… bmi statistically significantly increased by 2.8% (95% confidence interval: 1.5% to 4.1%; p < 0.001) …").

Application to the obesity corpus
Although we had relatively good recall in both the development and evaluation datasets, the experiments with the entire obesity dataset have shown that the system extracted epidemiological information only from a limited number of documents. We have therefore explored the reasons for that.

Study design
We identified study type from only around 40% of processed articles (each tagged as obesity/epidemiology). To explore whether those missed study design mentions are due to our incomplete dictionaries and rules, we inspected 20 randomly selected articles from those that contained no identified study type, and we identified the following possible reasons: No mention of study design: while the article presents an epidemiological context, no specific epidemiological study had been conducted (and thus there was no need to specify study design)this was the case in almost 2/3 of the abstracts with no study design; Summarised epidemiological studies: articles summarizing epidemiological information but without reporting a specific conducted study and its findings (15% of the abstracts); Other study designs: studies including comparative studies, surveys, pilot studies, follow-up studies, reports, reviews that were not targeted for identification (20% of the abstracts).
We note that we can see a similar pattern in the evaluation dataset (which was randomly selected from the obesity corpus). Importantly, for the majority of abstracts in the evaluation dataset, if the system was able to detect the study type, all other epidemiological

Covariates
Only 5,500 confounding factors were recognised. To explore the reason for so many articles not having covariates extracted, a random sample of 20 abstracts in which no covariate concept was identified was investigated. None of the studied abstracts contained any covariate mentions. Most abstracts used only generic expressions (e.g., "after adjustment for confounding factors", "after controlling for covariates") without specifying the respective concepts. We note that we only processed abstracts and it seems likely that covariates may be defined in full-text articles.

Effect size
Similar observations to the ones made for the covariate characteristic were noted for the effect size mentions (only 9,701 mentions were extracted). We explored a sample of 20 abstracts in which no effect size was recognised. As many as 60% of the abstracts did not report any observed effect size between the studied exposures and outcomes due to the nature of the conducted study (e.g. pilot study, systematic review, article). We failed, however, to get effect size mentions in 40% of cases, mainly because of mentions that contained coordinated expressions (e.g. "The prevalence of hypertension was considerably higher among men than among women (60.3% and 44.6%, respectively"; PMID 18791341) or statistical significance data, which are not covered by our rules.

Outcomes
As opposed to other characteristics, the number of recognised outcome concepts was more than double the number of abstracts. This is not a surprise, as most of the epidemiological studies include more than one outcome of interest. In addition, with the current system, we have not attempted to unify synonymous terms (unless they are simple orthographic variants).

Conclusions
We presented a generic rule based approach for the extraction of the six key characteristics (study design, population, exposure(s), outcome(s), covariate(s) and effect size) from epidemiological abstracts. The evaluation process revealed promising results with the F-score ranging between 82% and 96%, suggesting that automatic extraction of epidemiological elements from abstracts could be useful for mining key study characteristics and possible meta-analysis or systematic reviews. Also, extracted profiles can be used for identification of gaps and knowledge modelling of complex health problems. Although our experiments focused on obesity mainly for the purpose of evaluation, the suggested approach of identifying key epidemiological characteristics related to a particular clinical health problem is generic. Our current work does not include identification of synonymous expressions or more detailed mapping of identified terms to existing knowledge repositories, which would allow direct integration of the literature with other clinical resources. This will be the topic for our future work. Another potential limitation of the current work is that we focused only on abstracts, rather than full-text articles. It would be interesting to explore if full-text would improve the identification (in particular recall) or it would introduce more noise (reducing precision).

Availability and requirements
Project name: EpiTeM (Epidemiological Text Mining) Project home page: http://gnode1.mib.man.ac.uk/ epidemiology/ Operating system(s): Platform independent Programming language: Python Other requirements: MinorThird License: FreeBSD Any restrictions to use by non-academics: None