Ambiguity and variability of database and software names in bioinformatics

Duck, Geraint; Kovacevic, Aleksandar; Robertson, David L.; Stevens, Robert; Nenadic, Goran

doi:10.1186/s13326-015-0026-0

Research article
Open access
Published: 29 June 2015

Ambiguity and variability of database and software names in bioinformatics

Geraint Duck¹,
Aleksandar Kovacevic²,
David L. Robertson³,
Robert Stevens¹ &
…
Goran Nenadic^1,4

Journal of Biomedical Semantics volume 6, Article number: 29 (2015) Cite this article

3710 Accesses
8 Citations
3 Altmetric
Metrics details

Abstract

Background

There are numerous options available to achieve various tasks in bioinformatics, but until recently, there were no tools that could systematically identify mentions of databases and tools within the literature. In this paper we explore the variability and ambiguity of database and software name mentions and compare dictionary and machine learning approaches to their identification.

Results

Through the development and analysis of a corpus of 60 full-text documents manually annotated at the mention level, we report high variability and ambiguity in database and software mentions. On a test set of 25 full-text documents, a baseline dictionary look-up achieved an F-score of 46 %, highlighting not only variability and ambiguity but also the extensive number of new resources introduced. A machine learning approach achieved an F-score of 63 % (with precision of 74 %) and 70 % (with precision of 83 %) for strict and lenient matching respectively. We characterise the issues with various mention types and propose potential ways of capturing additional database and software mentions in the literature.

Conclusions

Our analyses show that identification of mentions of databases and tools is a challenging task that cannot be achieved by relying on current manually-curated resource repositories. Although machine learning shows improvement and promise (primarily in precision), more contextual information needs to be taken into account to achieve a good degree of accuracy.

Background

Bioinformatics and computational biology rely on domain databases and software to support data collection, aggregation and analysis and, as such, have been reported in research papers, typically as part of the methods section. However, limited progress has been made to systematically capture mentions of databases and tools in order to explore the bioinformatics practice of computational method on a large-scale. An evaluation of the resources available could help bioinformaticians to identify common usage patterns [1] and potentially infer scientific “best practice” [2] based on a measure of how often or where a particular resource is currently being used within an in silico workflow [3]. Although there are several inventories that list available database and software resources (e.g., the NAR databases and web-services special issues [4, 5], ExPASy [6], the Online Bioinformatics Resources Collection [7], etc.), until recently, to the best of our knowledge, there were no attempts to systematically identify resource mentions in the literature [8].

Biomedical text mining has seen wide usage in identifying mentions of entities of different types in the literature in recent years. Named entity recognition (NER) enables automated literature insights [9] and provides input to other text-mining applications. For example, within the fields of biology and bioinformatics, NER systems have been developed to capture species [10], proteins/genes [11–13], chemicals [14], etc. Issues of naming inconsistencies, numerous synonyms and acronyms, and an inability to distinguish entity names from common words in a natural language combined with ambiguous definitions of concepts, make NER a difficult task [15, 16]. Still, for some applications, NER tools achieve relatively high precision and recall scores. For example, LINNAEUS achieved F-scores around the 95 % mark for species name recognition and disambiguation on the mention and document levels [10]. On the other hand, gene names are known for their ambiguity and variability, resulting in lower reported F-scores. For example, ABNER [12] recorded an F-score of just under 73 % for strict-match gene name recognition (85 % with some boundary error toleration), and GNAT [13] reported an F-score of 81 % for the same task (up to a maximum of 90 % for single species gene name recognition, e.g., for yeast).

Some previous work exists on automated identification and harvesting of bioinformatics database and software names from the literature. For example, OReFiL [17] utilises the mentions of Unified Resource Locators (URLs) in text to recognise new resources to update its own internal index. Similarly, BIRI (BioInformatics Resource Inventory) uses a series of hand crafted regular expressions to automatically capture resource names, their functionality and classification from paper titles and abstracts [18]. The reported quality of the identification process was in line with other NER tasks. For example, BIRI successfully extracted resource names in 94 % of cases in a test corpus, which consisted of 392 abstracts that matched a search for “bioinformatics resource” and eight documents that were manually included to test domain robustness. However, both of these tools focused on updates and have biased their evaluation to resource rich text, which prevents full understanding of false negative errors in the general bioinformatics literature.

This paper aims to analyse database and software name mentions in the bioinformatics/computational biology literature to assess challenges for automated extraction. We analyse database and software names in the computational biology literature using a set of 60 full-text documents manually annotated at the mention level, building on our previous work [19]. We analyse the variability and ambiguity of bioinformatics resource names and compare dictionary and machine learning approaches for their identification based on the results on an additional dataset of 25 full-text documents. Although we focus here on bioinformatics resources, the challenges and solutions encountered in database and software recognition are generic, and thus not unique to this domain [20].

Methods

Corpus annotation and analysis

For the purpose of this study, we define databases as any electronic resource that stores records in a structured form, and provides unique identifiers to each record. These include any database, ontology, repository or classification resource, etc. Examples include SCOP (a database of protein structural classification) [21], UniProt (a database of protein sequences and functional information) [22], Gene Ontology (ontology that describes gene product attributes) [23], PubMed (a repository of abstracts) [24], etc. We adopt Wikipedia’s definition of software [25]: “a collection of computer programs … that provides the instructions for telling a computer what to do and how to do it”. We use program and tool as synonyms for software. Examples include BLAST (automated sequence comparison) [26], eUtils (access to literature data) [27], etc. We also include mentions of web-services as well as package names (e.g., R packages from Bioconductor [28, 29]). We explicitly exclude database record numbers/identifiers (e.g., GO:0002474, Q8HWB0), file formats (e.g., PDF), programming languages and their libraries (e.g., Python, BioPython), operating systems (e.g., Linux), algorithms (e.g., Merge-Sort), methods (e.g., ANOVA, Random Forests) and approaches (e.g., Machine Learning, Dynamic Programming).

To explore the use of database and tool names, we have developed an annotated set of 60 full-text articles from the PubMed Central [30] open-access subset. The articles were randomly selected from Genome Biology (5 articles), BMC Bioinformatics (36) and PLoS Computational Biology (19). These journals were selected as they could provide a broad overview of the bioinformatics and computational biology domain(s).

The articles were primarily annotated by a bioinformatician (GD) with experience in text mining. The annotation process included marking each database/software name mention. We note that associated designators of resources (e.g., words such as database, software) were included only if part of the official name (e.g., Gene Ontology). The inter-annotator agreement (IAA) [31] for the annotation of database and software names was calculated from five full-text articles randomly selected from the annotated corpus, which were annotated by a PhD student with bioinformatics and a text-mining background.

To assess the complexity, composition, variability and ambiguity of resource names, we performed an analysis of the annotated mentions. The corpus was pre-processed using a typical text-mining pipeline consisting of a tokeniser, sentence splitter and part-of-speech (POS) tagger from GATE’s ANNIE [32]. We analysed the length of names, their lexical (stemmed token-level) and structural composition (using POS tag patterns) and the level of variability and ambiguity as compared to common English words, acronyms and abbreviations.

In addition to the dataset of 60 articles that was used for analysis and development of NER tools, an additional dataset of 25 full-text annotated papers was created to assess the quality of the proposed NER approaches (see below).

Dictionary-based approach (baseline)

We compiled an extensive dictionary of database and software names from several existing sources (see Table 1). Some well-known acronyms and spelling/orthographic variants have also been added, resulting in 7322 entries with 8169 variants (6929 after removing repeats) for 6126 resources. The names collected in the dictionary were also analysed using a similar approach as used for the names appearing in the corpus (see above). We then used LINNAEUS [10] to match these names in text.

Table 1 Sources from which the database and software name dictionary is comprised

Ambiguity and variability of database and software names in bioinformatics

Abstract

Background

Results

Conclusions

Background

Methods

Corpus annotation and analysis

Dictionary-based approach (baseline)

Machine learning approach

Machine learning – post-processing

Evaluation

Results and discussion

Corpus annotations

Database and software name composition

Variability of resource names

Ambiguity of resource names

Dictionary-based matching

Machine-learning approach

Feature impact analysis for the ML model

Missed database and software mentions

False positive filtering

Conclusions

Availability of supporting data

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Biomedical Semantics

Contact us