Skip to main content

Mining characteristics of epidemiological studies from Medline: a case study in obesity



The health sciences literature incorporates a relatively large subset of epidemiological studies that focus on population-level findings, including various determinants, outcomes and correlations. Extracting structured information about those characteristics would be useful for more complete understanding of diseases and for meta-analyses and systematic reviews.


We present an information extraction approach that enables users to identify key characteristics of epidemiological studies from MEDLINE abstracts. It extracts six types of epidemiological characteristic: design of the study, population that has been studied, exposure, outcome, covariates and effect size. We have developed a generic rule-based approach that has been designed according to semantic patterns observed in text, and tested it in the domain of obesity. Identified exposure, outcome and covariate concepts are clustered into health-related groups of interest. On a manually annotated test corpus of 60 epidemiological abstracts, the system achieved precision, recall and F-score between 79-100%, 80-100% and 82-96% respectively. We report the results of applying the method to a large scale epidemiological corpus related to obesity.


The experiments suggest that the proposed approach could identify key epidemiological characteristics associated with a complex clinical problem from related abstracts. When integrated over the literature, the extracted data can be used to provide a more complete picture of epidemiological efforts, and thus support understanding via meta-analysis and systematic reviews.


Epidemiological studies aim to discover the patterns and determinants of diseases, and other health related states by studying the health of populations in standardised ways. They are valuable sources of evidence for public health measures and for shaping of research questions in the clinical and biological aspects of complex diseases. Nevertheless, the increasing amount of published literature leads to information overload, making the task of reading and integrating relevant knowledge a challenging process [13]. For example, there are more than 23,000 obesity-related articles reporting on different epidemiological findings, including almost 3,000 articles with obesity/epidemiology as a MeSH descriptor in 2012, with more than 15,000 such articles in the last 10 years. Therefore, there is a need for systems that enable the extraction of salient epidemiological study features in order to assist investigators to reduce the time required to detect, summarise and incorporate epidemiological information from the relevant literature [4].

Epidemiology is a relatively structured field with its own dictionary and reporting style, deliberately written in a typical semi-structured format in order to standardize and improve study design, communication and collaboration. The standard characteristics in most epidemiological studies include [5]:

  • study design - a specific plan or protocol that has been followed in the conduct of the study;

  • population - demographic details of the individuals (e.g., gender, age, ethnicity, nationality) participating in an epidemiological study;

  • exposure - a factor, event, characteristic or other definable entity that brings about change in a health condition or in other defined characteristics;

  • outcome - the consequence from the exposure in the population of interest;

  • covariate - a concept that is possibly predictive of the outcome under study;

  • effect size - the measure of the strength of the relationship between variables, that relates outcomes to exposures in the population of interest.

In this paper we present a system that enables the identification and retrieval of the key characteristics from the epidemiological studies. We have applied the system to the obesity epidemiological literature. Obesity is one of the most important health problems of the 21st century [6], presenting a great public health and economic challenge [79]. The rapid and worldwide spread of obesity has affected people of all ages, genders, geographies and ethnicities. It has been regarded as a multi-dimensional disorder [10], with major behavioural and environmental determinants, with genetics playing only a minor role [7].

Related work

In the last decade, a significant amount of research has been performed on the extraction of information in the biomedical field, especially on the identification of biological [11, 12] and clinical concepts [13, 14] in the literature. In clinical text mining, several attempts have been made to extract various kinds of information from case studies and clinical trials in particular [14, 1523]. For example, De Bruijn et al. [22] applied text classification with a “weak” regular expression matcher on randomized clinical trial (RCT) reports for the recognition of key trial information that included 23 characteristics (e.g. eligibility criteria, sample size, route of treatment, etc.) with overall precision of 75%. The system was further expanded to identify and extract specific characteristics such as primary outcome names and names of experimental treatment from journal articles reporting RCTs [4], with precision of 93%. However, they focused solely on RCTs and especially on randomized controlled drug treatment trials. Hara and Matsumoto [1] extracted information about the design of phase III clinical trials. They extracted patient population and compared associated treatments through noun phrase chunking and categorisation along with regular expression pattern matching. They reported precision for population and compared treatments of 80% and 82% respectively. Hansen et al. [2] worked on RCTs identifying the numbers of the trial participants through a support vector machine algorithm with 97% precision, while Fizman et al. [19] aimed to recognize metabolic syndrome risk factors in MEDLINE citations through automatic semantic interpretation with 67% precision. However, to the best of our knowledge, there is no approach available for recognising key information elements from various types of epidemiological studies that are related to a particular health problem.


Our approach involved the design and implementation of generic rule-based patterns, which identify mentions of particular characteristics of epidemiological studies in PubMed abstracts (Figure 1). The rules are based on patterns that were engineered from a sample of 60 epidemiological abstracts in the domain of obesity. Mentions of six semantic types (study design, population, exposures, outcomes, covariates and effect size) have been manually identified and reviewed. Additionally, a development set with additional 30 abstracts was used to optimise the performance of the rules. These steps are explained here in more details.

  1. 1.

    Abstract selection and species filtering. In the first step, abstracts are retrieved from PubMed using specific MeSH terms (e.g. obesity/epidemiology[mesh]). They are checked by LINNAEUS, a species identification system [24], to filter out studies based on non-human species.

  2. 2.

    Building of dictionaries of potential mentions. In the second step, a number of semantic classes are identified using custom-made vocabularies that include terms to detect key characteristics in epidemiological study abstracts (e.g. dictionaries of words that indicate tudy design, population totals, etc. – a total of fourteen dictionaries). We also identify mentions of Unified Medical Language System (UMLS) [25] terms and additionally apply the Specialist lexicon[26] in order to extract potential exposure, outcome, covariate and population concepts. Finally, epidemiological abstracts are processed with an automatic term recognition (ATR) method for the extraction of multi-word candidate concepts and their variants [27, 28]. Filtering against a common stop-word list (created by Fox [29]) is applied to remove any concepts of non-biomedical nature.

  3. 3.

    Mention-level application of rules. In the third step, rules are applied to the abstracts for each of the six epidemiological characteristic separately. The rules make use of two constituent types: frozen lexical expressions (used as anchors for specific categories) and specific semantic classes identified through the vocabularies (identified in step 2), which are combined using regular expressions. The frozen lexical expressions can contain particular verbs, prepositions or certain nouns. Table 1 shows the number of rules created for each of the six characteristics with some typical examples. As a result of the application of rules, candidate mentions of epidemiological concepts are tagged in text. We used MinorThird [30] for annotating and recognizing entities of interest.

  4. 4.

    Document-level unification. Finally, in cases where several candidate mentions for a single epidemiological characteristic were recognised in a given document, we also ‘unified’ them to get document-level annotations using the following approach: if a given mention is part of a longer mention, then we select only the longer. Mentions that are not included in other mentions (of the same type) are also returned. In addition, where applicable (i.e. for exposures, outcomes and covariates), these mentions are mapped to one of the 15 UMLS semantic groups (Activities and Behaviors, Anatomy, Chemicals and Drugs, Concepts and Ideas, Devices, Disorders, Genes and Molecular, Geographic Areas, Living Beings, Objects, Occupations, Organizations, Phenomena, Physiology and Procedures). We decided to perform the mapping to high-level UMLS semantic groups to assist epidemiologists in the application of an ‘epidemiological sieve’, which could help them decide whether or not to include abstracts for more detailed inspection. For example, highlighting different types of determinant (e.g. demographic vs. lifestyle) would be useful for considering the completeness and relevance of factors in a particular study by emphasizing possible connections between the background of the exposure and/or the outcomes.

Figure 1
figure 1

The four steps of the approach applied to epidemiological abstracts in order to recognise key characteristics. Linnaeus is used to filter out abstracts not related to humans; Dictionary look-up and automatic term recognition (ATR) are applied to identify major medical concepts in text; MinorThird is used as an environment for the rule application and mention identification of epidemiological characteristics.

Table 1 Examples of rules for recognition of study design, population, exposure, outcome, covariate and effect size in epidemiological abstracts



We evaluated the system’s performance at the document level by considering whether selected spans were correctly marked in text. We calculated precision, recall and F-score for each of the characteristic of interest using the standard definitions [31]. In order to create an evaluation dataset, 60 abstracts were randomly selected from the PubMed results obtained by query obesity/epidemiology[mesh] and manually double-annotated for all the six epidemiological characteristics by the first author and an external curator with epidemiological expertise. The inter-annotator agreement of 80% was calculated on the evaluation dataset by the absolute agreement rate [32], suggesting relatively reliable annotations.

Table 2 shows the results on the evaluation set, with to the results obtained on the training and development sets for comparison (Tables 3 and 4). The precision and recall values ranged from 79% to 100% and 80% to 100%, with F-measures between 82% and 96%. The best precision was observed for study design (100%). However, despite having a relatively large number of study design mentions in the training set (38 out of 60), the development and evaluation sets had notably fewer mentions and therefore the precision value should be taken with caution. Similarly, the system retrieved covariate characteristic with 100% recall, but again the number of annotated covariate concepts was low. The lowest precision was observed for outcomes (79%), while exposures had the lowest recall (80%). With the exception of study design that saw a little increase (7.7%), recall decreased for the rest of the characteristics when compared to the values on the development set. On the other hand, effect size had a notable increase in precision, from 75% (development) to 97% (evaluation). Overall, the micro F-score, precision and recall for all the six epidemiological characteristics were 87%, 88% and 86% respectively, suggesting reliable performance in the identification of epidemiological information from the literature.

Table 2 Results, including true positives (TP), false positives (FP), false negative (FN), precision (P), recall (R) and F-score on the evaluation set
Table 3 Results, including true positives (TP), false positives (FP), false negative (FN), precision (P), recall (R) and F-score on the training set
Table 4 Results, including true positives (TP), false positives (FP), false negative (FN), precision (P), recall (R) and F-score on the development set

Application to the obesity corpus

We applied the system on a large scale corpus consisting of 23,690 epidemiological PubMed abstracts returned by the obesity/epidemiology[mesh] query (restricted to English). We note that a number of returned MEDLINE citations did not contain any abstract, resulting in 19,188 processed citations. In total, we extracted 6,060 mentions of study designs; 13,537 populations; 23,518 exposures; 40,333 outcomes; 5,500 covariates and 9,701 mentions of effect sizes.

Table 5 shows most frequent study types in obesity epidemiological research. The most common epidemiological study designs are cohort cross-sectional (n = 1,940; 32%) and cohort studies (n = 1876; 31% of all recognized studies), whereas there were only 109 (1.7%) randomized clinical trials. Tables 6, 7, 8, 9, 10 and 11 present the most frequent exposures, outcomes and covariates along with their UMLS semantic types.

Table 5 The most frequent study designs extracted from the obesity epidemiological literature
Table 6 The most frequent exposures extracted from the obesity epidemiological literature
Table 7 Distribution of UMLS semantic groups assigned to exposures
Table 8 The most frequent outcomes extracted from the obesity epidemiological literature
Table 9 Distribution of UMLS semantic groups assigned to outcomes
Table 10 The most frequent covariates extracted from the obesity epidemiological literature
Table 11 Distribution of UMLS semantic groups assigned to covariates


Compared to other approaches that focused specifically on randomized clinical trials, our approach addresses a significantly more diverse literature space. We aimed at extracting key epidemiological characteristics, which are typically more complex than those presented in clinical trials. This is not surprising because clinical trials are subject to strict regulations and are reported in highly standardised ways. Although this makes it difficult to compare our results with those of others directly, we still note that our precision (79-100%) is comparable to other studies (67-93%). The overall F-score of 87% suggests that a rule-based approach can generate reliable results in epidemiological text mining despite the restrained nature of the targeted concepts. Here we discuss several challenges and issues related to epidemiological text mining, and indicate the areas for future work.

Complex and implicit expressions

Despite having relatively reliable annotations (recall the inter-annotator agreement of 80%), epidemiological abstracts feature a number of complex, varying detail and implicit expressions that are challenging for text mining. For example, there are various ways in which population can be described: from reporting age, sex and geographical region to mentioning the disease the individuals are currently affected with or that are excluded from the study (e.g. “The study comprised of 52 subjects with histologically confirmed advanced colorectal polyps and 53 healthy controls” [PMID – 21235114]). Even more complex are the ways in which exposures are expressed, given that these are not often explicitly stated in text as exposures but rather part of the context of the study. Similarly, identification of covariate concepts is challenging as only a small number of covariates are explicitly stated in text.

Finally, out dictionary coverage and focus were quite limited by design: we focused on biomedical concepts, but other types of concepts may be studied as determinants and outcomes, or being mentioned as covariates (e.g., “high school environmental activity”). While these have been addressed by application of ATR, more generic vocabularies may need to be used (see below for some examples).

Error analysis on the evaluation dataset

Our approach is based on intensive lexical and terminological pre-processing and rules to identify the key epidemiological characteristics. The number of rules designed for obesity can be considered relatively high (412), given that they were engineered from relatively small training (and development) datasets. On one hand, the number of rules for study design (16), covariate (28) and effect size (15) were rather small in comparison to others e.g., population (119), indicating the existence of generic expression patterns that can identify concept types from more generic epidemiological characteristics (such as study design or effect size). However, disease-related concepts often include a variety of determinants along with a number of outcomes of various nature (e.g. anatomical, biological, disease-related, etc.). Therefore, on the other hand, the task of recognizing these epidemiological elements (e.g., outcomes, exposures) through a rule based approach is not an easy task and requires a number of rules to accommodate different types of expression. We briefly discuss the cases of errors for each of the characteristic below.

Study design

Due to the limited number of study design mentions (only 13) in the evaluation set, the high values of precision, recall and F-score should be taken with caution. There were no false positives in the evaluation data set. However, it is possible that in a larger dataset, false positives could appear if certain citations report more than one mention of different study types. In addition, study designs without specific information can be ambiguous and thus were ignored (e.g. “Metabolic and bariatric surgery for obesity: a review [False Negative]”).


An analysis of false positives reveals that rules relying on the identification of prepositional phrases associated with populations (e.g. among and in) need more specific presence of patient-related concepts. False negatives included “3,715 deliveries” or “895 veterans who had bariatric surgery”, which are referring to births and a specific demographic respectively, but our lexical resources did not contain those. Nevertheless, the F-score for the population type was the second best (93%), showing that a rule-based approach can be used to identify the participants in epidemiological studies. An interesting issue arose in the identification of population associated to meta-analyses. For example, the mention “included 3 studies involving 127 children” was identified by patterns but it is clear that a specific approach would be needed for meta-analysis studies.

Exposures and outcomes

While outcomes are often explicitly mentioned in text as such, exposure concepts are not, which makes the identification of exposures a particularly challenging task. Still, the use of dictionaries containing biomedical concepts for identification of potential mentions proved useful for capturing exposure concepts. However, dictionary-based look-up also contributed to incorrect exposure candidates that were extracted from non-relevant contexts. On the other hand, two frequent causes of errors could be linked to missing concepts from our dictionaries (e.g. “late bedtimes” or “costs”) and relatively complex exposure expressions (e.g. “level of PA during leisure”).

An important source of errors was the confusion between exposures and outcomes, given they both refer to similar (semantic) types whose instances can – in different studies – be either exposure or outcome, and thus their role can be easily misinterpreted as an outcome rather than a studied determinant (and vice versa). We noted that rules such as “association between < exposure > and < outcome>” or “<exposure > associated with < outcome>” generated encouraging results i.e., a number of TPs. This was not surprising: when a clinical professional is studying the relationship between two concepts, he explores the link between an exposure and an outcome, which the above patterns capture. Still, sometimes these patterns would match links irrelevant to exposure/outcome relationships (e.g. “relationship between race and gender”). Cases like these result in the generation of both false positives and false negatives. Overall, a sentence-focused rule based method may struggle to understand a concept’s role in a given case, and a wider context might need to be considered.


Covariates had only a limited number of identified spans, hence any conclusion regarding the system’s performance is at most indicative. Still, the results could provide an initial indication that (at least explicit) covariate mentions could be detected with good accuracy, despite some false positives (e.g. a generic mention “potential confounders” was identified as a covariate in “… after adjustment for potential confounders”).

Effect size

The rules designed to recognize effect size spans were based on the combination of numerical and specific lexical expressions (e.g. “relative risk”, “confidence interval”). A relatively high recall (87%) revealed that this approach returned promising results, with only a small number of mentions being ignored by the system, but with high precision. False negatives included expressions that included multiple values (e.g., “… increased risks of overweight/obesity at the age of 4 years (odds ratio (95% confidence interval): 15.01 (9.63, 23.38))”, “… bmi statistically significantly increased by 2.8% (95% confidence interval: 1.5% to 4.1%; p < 0.001) …”).

Application to the obesity corpus

Although we had relatively good recall in both the development and evaluation datasets, the experiments with the entire obesity dataset have shown that the system extracted epidemiological information only from a limited number of documents. We have therefore explored the reasons for that.

Study design

We identified study type from only around 40% of processed articles (each tagged as obesity/epidemiology). To explore whether those missed study design mentions are due to our incomplete dictionaries and rules, we inspected 20 randomly selected articles from those that contained no identified study type, and we identified the following possible reasons:

  • No mention of study design: while the article presents an epidemiological context, no specific epidemiological study had been conducted (and thus there was no need to specify study design) – this was the case in almost 2/3 of the abstracts with no study design;

  • Summarised epidemiological studies: articles summarizing epidemiological information but without reporting a specific conducted study and its findings (15% of the abstracts);

  • Other study designs: studies including comparative studies, surveys, pilot studies, follow-up studies, reports, reviews that were not targeted for identification (20% of the abstracts).

We note that we can see a similar pattern in the evaluation dataset (which was randomly selected from the obesity corpus). Importantly, for the majority of abstracts in the evaluation dataset, if the system was able to detect the study type, all other epidemiological characteristics have been extracted with relative success, providing a complete profile of an epidemiological study (data not shown).


Only 5,500 confounding factors were recognised. To explore the reason for so many articles not having covariates extracted, a random sample of 20 abstracts in which no covariate concept was identified was investigated. None of the studied abstracts contained any covariate mentions. Most abstracts used only generic expressions (e.g., “after adjustment for confounding factors”, “after controlling for covariates”) without specifying the respective concepts. We note that we only processed abstracts and it seems likely that covariates may be defined in full-text articles.

Effect size

Similar observations to the ones made for the covariate characteristic were noted for the effect size mentions (only 9,701 mentions were extracted). We explored a sample of 20 abstracts in which no effect size was recognised. As many as 60% of the abstracts did not report any observed effect size between the studied exposures and outcomes due to the nature of the conducted study (e.g. pilot study, systematic review, article). We failed, however, to get effect size mentions in 40% of cases, mainly because of mentions that contained coordinated expressions (e.g. “The prevalence of hypertension was considerably higher among men than among women (60.3% and 44.6%, respectively”; PMID 18791341) or statistical significance data, which are not covered by our rules.


As opposed to other characteristics, the number of recognised outcome concepts was more than double the number of abstracts. This is not a surprise, as most of the epidemiological studies include more than one outcome of interest. In addition, with the current system, we have not attempted to unify synonymous terms (unless they are simple orthographic variants).


We presented a generic rule based approach for the extraction of the six key characteristics (study design, population, exposure(s), outcome(s), covariate(s) and effect size) from epidemiological abstracts. The evaluation process revealed promising results with the F-score ranging between 82% and 96%, suggesting that automatic extraction of epidemiological elements from abstracts could be useful for mining key study characteristics and possible meta-analysis or systematic reviews. Also, extracted profiles can be used for identification of gaps and knowledge modelling of complex health problems. Although our experiments focused on obesity mainly for the purpose of evaluation, the suggested approach of identifying key epidemiological characteristics related to a particular clinical health problem is generic.

Our current work does not include identification of synonymous expressions or more detailed mapping of identified terms to existing knowledge repositories, which would allow direct integration of the literature with other clinical resources. This will be the topic for our future work. Another potential limitation of the current work is that we focused only on abstracts, rather than full-text articles. It would be interesting to explore if full-text would improve the identification (in particular recall) or it would introduce more noise (reducing precision).

Availability and requirements

Project name: EpiTeM (Epidemiological Text Mining)

Project home page:

Operating system(s): Platform independent

Programming language: Python

Other requirements: MinorThird

License: FreeBSD

Any restrictions to use by non-academics: None



Automatic term recognition


False negatives


False positives






Randomized clinical trial


True positives


Unified Medical Language System.


  1. Hara K, Matsumoto Y: Extracting clinical trial design information from MEDLINE abstracts. N Gener Comput. 2007, 25: 263-275. 10.1007/s00354-007-0017-5.

    Article  Google Scholar 

  2. Hansen JM, Rasmussen ON, Chung G: A Method for Extracting the Number of Trial Participants from Abstracts of Randomized Controlled Trials. J Telemed Telecare. 2008, 14 (7): 354-358. 10.1258/jtt.2008.007007.

    Article  Google Scholar 

  3. Chung YG: Sentence retrieval for abstracts of randomized controlled trials. BMC Med Informat Decis Making. 2009, 9: 10-10.1186/1472-6947-9-10. doi:10.1186/1472-6947-9-10

    Article  Google Scholar 

  4. Kiritchenko S, De Bruijn B, Carini S, Martin J, Sim I: ExaCT: Automatic extraction of clinical trial characteristics from Journal Publications. BMC Med Informat Decis Making. 2010, 10: 56-10.1186/1472-6947-10-56.

    Article  Google Scholar 

  5. Last MJ: A Dictionary of Epidemiology. 2001, New York: Oxford University Press, 180-

    Google Scholar 

  6. Buchan I, Canoy D: Challenges in obesity epidemiology. Obes Rev. 2007, 8 (suppl 1): 1-11.

    Google Scholar 

  7. Hossain P, Kawar B, El Nahas M: Obesity and diabetes in the developing world – a growing challenge. New Engl J Med. 2007, 356 (3): 213-215. 10.1056/NEJMp068177.

    Article  Google Scholar 

  8. Duncan M, Griffith M, Rutter H, Goldacre JM: Certification of obesity as a cause of death in England 1979–2006. Eur J Public Health Advance Access. 2010, 20 (6): 671-675. 10.1093/eurpub/ckp230.

    Article  Google Scholar 

  9. World Health Organisation (WHO): Definition of Obesity, Risk Factors, Complications, Epidemiology. 2012, []

    Google Scholar 

  10. Ogden LC: The epidemiology of obesity. Gastroenterology. 2007, 132: 2087-10.1053/j.gastro.2007.03.052.

    Article  Google Scholar 

  11. Cohen MA, Hersh RW: A survey of current work in biomedical text mining. Brief Bioninform. 2005, 6 (1): 57-71. 10.1093/bib/6.1.57.

    Article  Google Scholar 

  12. Meystre MS, Savova KG, Kipper-Schuler CK, Hurdle FJ: Extracting information from textual documents in the electronic health record: a review of recent research. Methods Inf Med. 2008, 47 (Suppl 1): 128-144.

    Google Scholar 

  13. Aramaki E, Miura Y, Tonoike M, Ohkuma T, Masuichi H, Waki K, Ohe K: Extraction of ADE from clinical records. IMIA. 2010, doi:10,3233/978-1-60750-588-4-739

    Google Scholar 

  14. Chowdhury MF, Lavelli A: Disease Mention Recognition with Specific Features. Proceedings of the 2010 Workshop on BNLP, ACL. 2010, Uppsala, Sweden: Association for Computational Linguistics, 83-90.

    Google Scholar 

  15. Niu Y, Hirst G: Analysis of Semantic Classes in Medical Text for Q&A. Proc. ACL Workshop on Question Answering in Restricted Domains. 2004, 54-61.

    Google Scholar 

  16. Borlawsky T, Friedman C, Lussier AY: Generating Executable Knowledge for Evidence-based Medicine Using Natural Language and Semantic Processing. AMIA Annual Symposium. 2006, 56-60.

    Google Scholar 

  17. Chung YG, Coiera E: A Study of Structured Clinical Abstracts and the Semantic Classification of Sentences. Proceedings of the ACL workshop BioNLP. Association for Computational Linguistics. 2007, 121-128.

    Google Scholar 

  18. Demner-Fushman D, Lin J: Answering clinical questions with knowledge-based and statistical techniques. Comput Ling. 2007, 33 (1): 63-103. 10.1162/coli.2007.33.1.63.

    Article  Google Scholar 

  19. Fiszman M, Rosemblat G, Ahlers CB, Rindflesch TC: Identifying risk factors for metabolic syndrome in biomedical text. AMIA Annu Symp Proc. 2007, 2007: 249-253.

    Google Scholar 

  20. Xu R, Gatern Y, Superkar SK, Das KA, Altman BR, Garber MA: Extracting Subject Demographic Information from Abstracts of Randomized Clinical Trial Reports. Proc. 12th World Congress on Health (Medical) Informatics. 2007, 550-554.

    Google Scholar 

  21. Chen ES, Hirpcsak G, Xu H, Markatou M, Friedman C: Automated acquisition of disease-drug knowledge from biomedical and clinical documents: an initial study. J Am Med Infor Assoc. 2008, 15: 87-98.

    Article  Google Scholar 

  22. De Bruijn B, Carini S, Kiritchenko S, Martin J, Sim I: Automated Information Extraction of Key Trial Design Elements from Clinical Trial Reports. AMIA Annual Symposium. 2008, 141-145.

    Google Scholar 

  23. Chung YG: Towards identifying intervention arms in RCTs: extracting coordinating constructions. J Biomed Inform. 2009, 42 (5): 790-800. 10.1016/j.jbi.2008.12.011.

    Article  Google Scholar 

  24. Gerner M, Nenadic G, Bergman CM: LINNAEUS: a species name identification system for biomedical literature. BMC Bioinformatics. 2010, 11: 85-10.1186/1471-2105-11-85.

    Article  Google Scholar 

  25. Aronson AR, Lang FM: An overview of MetaMap: historical perspectove and recent advances. J Am Med Inform Assoc. 2010, 17 (3): 229-236.

    Article  Google Scholar 

  26. SPECIALIST Lexicon. 2014,,

  27. Frantzi K, Ananiadou S: Automatic Recognition of Multi-Word Terms: the C/NC value method. Intern J Digital Libraries. 2000, 3 (2): 115-130. 10.1007/s007999900023.

    Article  Google Scholar 

  28. Nenadić G, Ananiadou S, McNaught J: Enhancing automatic term recognition through recognition of variation. Proceedings of COLING. 2004, Geneva, 604-610.

    Chapter  Google Scholar 

  29. Fox C: A Stop List for General Text. ACM SIGIR Forum, Volume 24, no. 1–2. 1989, New York, NY, USA: ACM, 19-21.

    Google Scholar 

  30. Cohen WW: MinorThird: Methods for Identifying Names and Ontological Relations in Text using Heuristics for Inducing Regularities from Data. 2004,,

    Google Scholar 

  31. Ananiadou S, Kell DB, Tsujii J: Text mining and its potential applications in systems biology. Trends Biotechnol. 2006, 24 (12): 571-579. 10.1016/j.tibtech.2006.10.002.

    Article  Google Scholar 

  32. Kim JD, Tsujii J: Corpora and their annotations. Text Mining for Biology and Biomedicine. Edited by: Ananiadou S, McNaught J. 2006, Artech House, ISBN 1-5053-984-X

    Google Scholar 

Download references


We would like to thank Katherine McAllister (Institute of Population Health, University of Manchester) for the annotation of the datasets. This work was partially supported by the UK Medical Research Council via a PhD grant to GK. GN and IB are partially-supported by the Health e-Research Centre (HeRC) grant. GN acknowledges support from the Serbian Ministry of Education and Science (projects III44006; III47003).

This article has been published as part of the Semantic Mining of Languages in Biology and Medicine (SMLBM) thematic series of the Journal of Biomedical Semantics. An initial version of the article was presented at the 4th International Symposium on Languages in Biology and Medicine (LBM) in 2011.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Goran Nenadic.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

The study was conceived and designed by IB and GN. GK implemented the system, provided the data and performed the experiments. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Rights and permissions

Open Access  This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit

The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Karystianis, G., Buchan, I. & Nenadic, G. Mining characteristics of epidemiological studies from Medline: a case study in obesity. J Biomed Semant 5, 22 (2014).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: