Compared to other approaches that focused specifically on randomized clinical trials, our approach addresses a significantly more diverse literature space. We aimed at extracting key epidemiological characteristics, which are typically more complex than those presented in clinical trials. This is not surprising because clinical trials are subject to strict regulations and are reported in highly standardised ways. Although this makes it difficult to compare our results with those of others directly, we still note that our precision (79-100%) is comparable to other studies (67-93%). The overall F-score of 87% suggests that a rule-based approach can generate reliable results in epidemiological text mining despite the restrained nature of the targeted concepts. Here we discuss several challenges and issues related to epidemiological text mining, and indicate the areas for future work.
Complex and implicit expressions
Despite having relatively reliable annotations (recall the inter-annotator agreement of 80%), epidemiological abstracts feature a number of complex, varying detail and implicit expressions that are challenging for text mining. For example, there are various ways in which population can be described: from reporting age, sex and geographical region to mentioning the disease the individuals are currently affected with or that are excluded from the study (e.g. “The study comprised of 52 subjects with histologically confirmed advanced colorectal polyps and 53 healthy controls” [PMID – 21235114]). Even more complex are the ways in which exposures are expressed, given that these are not often explicitly stated in text as exposures but rather part of the context of the study. Similarly, identification of covariate concepts is challenging as only a small number of covariates are explicitly stated in text.
Finally, out dictionary coverage and focus were quite limited by design: we focused on biomedical concepts, but other types of concepts may be studied as determinants and outcomes, or being mentioned as covariates (e.g., “high school environmental activity”). While these have been addressed by application of ATR, more generic vocabularies may need to be used (see below for some examples).
Error analysis on the evaluation dataset
Our approach is based on intensive lexical and terminological pre-processing and rules to identify the key epidemiological characteristics. The number of rules designed for obesity can be considered relatively high (412), given that they were engineered from relatively small training (and development) datasets. On one hand, the number of rules for study design (16), covariate (28) and effect size (15) were rather small in comparison to others e.g., population (119), indicating the existence of generic expression patterns that can identify concept types from more generic epidemiological characteristics (such as study design or effect size). However, disease-related concepts often include a variety of determinants along with a number of outcomes of various nature (e.g. anatomical, biological, disease-related, etc.). Therefore, on the other hand, the task of recognizing these epidemiological elements (e.g., outcomes, exposures) through a rule based approach is not an easy task and requires a number of rules to accommodate different types of expression. We briefly discuss the cases of errors for each of the characteristic below.
Study design
Due to the limited number of study design mentions (only 13) in the evaluation set, the high values of precision, recall and F-score should be taken with caution. There were no false positives in the evaluation data set. However, it is possible that in a larger dataset, false positives could appear if certain citations report more than one mention of different study types. In addition, study designs without specific information can be ambiguous and thus were ignored (e.g. “Metabolic and bariatric surgery for obesity: a review [False Negative]”).
Population
An analysis of false positives reveals that rules relying on the identification of prepositional phrases associated with populations (e.g. among and in) need more specific presence of patient-related concepts. False negatives included “3,715 deliveries” or “895 veterans who had bariatric surgery”, which are referring to births and a specific demographic respectively, but our lexical resources did not contain those. Nevertheless, the F-score for the population type was the second best (93%), showing that a rule-based approach can be used to identify the participants in epidemiological studies. An interesting issue arose in the identification of population associated to meta-analyses. For example, the mention “included 3 studies involving 127 children” was identified by patterns but it is clear that a specific approach would be needed for meta-analysis studies.
Exposures and outcomes
While outcomes are often explicitly mentioned in text as such, exposure concepts are not, which makes the identification of exposures a particularly challenging task. Still, the use of dictionaries containing biomedical concepts for identification of potential mentions proved useful for capturing exposure concepts. However, dictionary-based look-up also contributed to incorrect exposure candidates that were extracted from non-relevant contexts. On the other hand, two frequent causes of errors could be linked to missing concepts from our dictionaries (e.g. “late bedtimes” or “costs”) and relatively complex exposure expressions (e.g. “level of PA during leisure”).
An important source of errors was the confusion between exposures and outcomes, given they both refer to similar (semantic) types whose instances can – in different studies – be either exposure or outcome, and thus their role can be easily misinterpreted as an outcome rather than a studied determinant (and vice versa). We noted that rules such as “association between < exposure > and < outcome>” or “<exposure > associated with < outcome>” generated encouraging results i.e., a number of TPs. This was not surprising: when a clinical professional is studying the relationship between two concepts, he explores the link between an exposure and an outcome, which the above patterns capture. Still, sometimes these patterns would match links irrelevant to exposure/outcome relationships (e.g. “relationship between race and gender”). Cases like these result in the generation of both false positives and false negatives. Overall, a sentence-focused rule based method may struggle to understand a concept’s role in a given case, and a wider context might need to be considered.
Covariates
Covariates had only a limited number of identified spans, hence any conclusion regarding the system’s performance is at most indicative. Still, the results could provide an initial indication that (at least explicit) covariate mentions could be detected with good accuracy, despite some false positives (e.g. a generic mention “potential confounders” was identified as a covariate in “… after adjustment for potential confounders”).
Effect size
The rules designed to recognize effect size spans were based on the combination of numerical and specific lexical expressions (e.g. “relative risk”, “confidence interval”). A relatively high recall (87%) revealed that this approach returned promising results, with only a small number of mentions being ignored by the system, but with high precision. False negatives included expressions that included multiple values (e.g., “… increased risks of overweight/obesity at the age of 4 years (odds ratio (95% confidence interval): 15.01 (9.63, 23.38))”, “… bmi statistically significantly increased by 2.8% (95% confidence interval: 1.5% to 4.1%; p < 0.001) …”).
Application to the obesity corpus
Although we had relatively good recall in both the development and evaluation datasets, the experiments with the entire obesity dataset have shown that the system extracted epidemiological information only from a limited number of documents. We have therefore explored the reasons for that.
Study design
We identified study type from only around 40% of processed articles (each tagged as obesity/epidemiology). To explore whether those missed study design mentions are due to our incomplete dictionaries and rules, we inspected 20 randomly selected articles from those that contained no identified study type, and we identified the following possible reasons:
-
No mention of study design: while the article presents an epidemiological context, no specific epidemiological study had been conducted (and thus there was no need to specify study design) – this was the case in almost 2/3 of the abstracts with no study design;
-
Summarised epidemiological studies: articles summarizing epidemiological information but without reporting a specific conducted study and its findings (15% of the abstracts);
-
Other study designs: studies including comparative studies, surveys, pilot studies, follow-up studies, reports, reviews that were not targeted for identification (20% of the abstracts).
We note that we can see a similar pattern in the evaluation dataset (which was randomly selected from the obesity corpus). Importantly, for the majority of abstracts in the evaluation dataset, if the system was able to detect the study type, all other epidemiological characteristics have been extracted with relative success, providing a complete profile of an epidemiological study (data not shown).
Covariates
Only 5,500 confounding factors were recognised. To explore the reason for so many articles not having covariates extracted, a random sample of 20 abstracts in which no covariate concept was identified was investigated. None of the studied abstracts contained any covariate mentions. Most abstracts used only generic expressions (e.g., “after adjustment for confounding factors”, “after controlling for covariates”) without specifying the respective concepts. We note that we only processed abstracts and it seems likely that covariates may be defined in full-text articles.
Effect size
Similar observations to the ones made for the covariate characteristic were noted for the effect size mentions (only 9,701 mentions were extracted). We explored a sample of 20 abstracts in which no effect size was recognised. As many as 60% of the abstracts did not report any observed effect size between the studied exposures and outcomes due to the nature of the conducted study (e.g. pilot study, systematic review, article). We failed, however, to get effect size mentions in 40% of cases, mainly because of mentions that contained coordinated expressions (e.g. “The prevalence of hypertension was considerably higher among men than among women (60.3% and 44.6%, respectively”; PMID 18791341) or statistical significance data, which are not covered by our rules.
Outcomes
As opposed to other characteristics, the number of recognised outcome concepts was more than double the number of abstracts. This is not a surprise, as most of the epidemiological studies include more than one outcome of interest. In addition, with the current system, we have not attempted to unify synonymous terms (unless they are simple orthographic variants).