Ambiguity and variability of database and software names in bioinformatics
© Duck et al. 2015
Received: 8 July 2013
Accepted: 5 June 2015
Published: 29 June 2015
There are numerous options available to achieve various tasks in bioinformatics, but until recently, there were no tools that could systematically identify mentions of databases and tools within the literature. In this paper we explore the variability and ambiguity of database and software name mentions and compare dictionary and machine learning approaches to their identification.
Through the development and analysis of a corpus of 60 full-text documents manually annotated at the mention level, we report high variability and ambiguity in database and software mentions. On a test set of 25 full-text documents, a baseline dictionary look-up achieved an F-score of 46 %, highlighting not only variability and ambiguity but also the extensive number of new resources introduced. A machine learning approach achieved an F-score of 63 % (with precision of 74 %) and 70 % (with precision of 83 %) for strict and lenient matching respectively. We characterise the issues with various mention types and propose potential ways of capturing additional database and software mentions in the literature.
Our analyses show that identification of mentions of databases and tools is a challenging task that cannot be achieved by relying on current manually-curated resource repositories. Although machine learning shows improvement and promise (primarily in precision), more contextual information needs to be taken into account to achieve a good degree of accuracy.
KeywordsBioinformatics Computational biology CRF Dictionary Resource extraction Text-mining
Bioinformatics and computational biology rely on domain databases and software to support data collection, aggregation and analysis and, as such, have been reported in research papers, typically as part of the methods section. However, limited progress has been made to systematically capture mentions of databases and tools in order to explore the bioinformatics practice of computational method on a large-scale. An evaluation of the resources available could help bioinformaticians to identify common usage patterns  and potentially infer scientific “best practice”  based on a measure of how often or where a particular resource is currently being used within an in silico workflow . Although there are several inventories that list available database and software resources (e.g., the NAR databases and web-services special issues [4, 5], ExPASy , the Online Bioinformatics Resources Collection , etc.), until recently, to the best of our knowledge, there were no attempts to systematically identify resource mentions in the literature .
Biomedical text mining has seen wide usage in identifying mentions of entities of different types in the literature in recent years. Named entity recognition (NER) enables automated literature insights  and provides input to other text-mining applications. For example, within the fields of biology and bioinformatics, NER systems have been developed to capture species , proteins/genes [11–13], chemicals , etc. Issues of naming inconsistencies, numerous synonyms and acronyms, and an inability to distinguish entity names from common words in a natural language combined with ambiguous definitions of concepts, make NER a difficult task [15, 16]. Still, for some applications, NER tools achieve relatively high precision and recall scores. For example, LINNAEUS achieved F-scores around the 95 % mark for species name recognition and disambiguation on the mention and document levels . On the other hand, gene names are known for their ambiguity and variability, resulting in lower reported F-scores. For example, ABNER  recorded an F-score of just under 73 % for strict-match gene name recognition (85 % with some boundary error toleration), and GNAT  reported an F-score of 81 % for the same task (up to a maximum of 90 % for single species gene name recognition, e.g., for yeast).
Some previous work exists on automated identification and harvesting of bioinformatics database and software names from the literature. For example, OReFiL  utilises the mentions of Unified Resource Locators (URLs) in text to recognise new resources to update its own internal index. Similarly, BIRI (BioInformatics Resource Inventory) uses a series of hand crafted regular expressions to automatically capture resource names, their functionality and classification from paper titles and abstracts . The reported quality of the identification process was in line with other NER tasks. For example, BIRI successfully extracted resource names in 94 % of cases in a test corpus, which consisted of 392 abstracts that matched a search for “bioinformatics resource” and eight documents that were manually included to test domain robustness. However, both of these tools focused on updates and have biased their evaluation to resource rich text, which prevents full understanding of false negative errors in the general bioinformatics literature.
This paper aims to analyse database and software name mentions in the bioinformatics/computational biology literature to assess challenges for automated extraction. We analyse database and software names in the computational biology literature using a set of 60 full-text documents manually annotated at the mention level, building on our previous work . We analyse the variability and ambiguity of bioinformatics resource names and compare dictionary and machine learning approaches for their identification based on the results on an additional dataset of 25 full-text documents. Although we focus here on bioinformatics resources, the challenges and solutions encountered in database and software recognition are generic, and thus not unique to this domain .
Corpus annotation and analysis
For the purpose of this study, we define databases as any electronic resource that stores records in a structured form, and provides unique identifiers to each record. These include any database, ontology, repository or classification resource, etc. Examples include SCOP (a database of protein structural classification) , UniProt (a database of protein sequences and functional information) , Gene Ontology (ontology that describes gene product attributes) , PubMed (a repository of abstracts) , etc. We adopt Wikipedia’s definition of software : “a collection of computer programs … that provides the instructions for telling a computer what to do and how to do it”. We use program and tool as synonyms for software. Examples include BLAST (automated sequence comparison) , eUtils (access to literature data) , etc. We also include mentions of web-services as well as package names (e.g., R packages from Bioconductor [28, 29]). We explicitly exclude database record numbers/identifiers (e.g., GO:0002474, Q8HWB0), file formats (e.g., PDF), programming languages and their libraries (e.g., Python, BioPython), operating systems (e.g., Linux), algorithms (e.g., Merge-Sort), methods (e.g., ANOVA, Random Forests) and approaches (e.g., Machine Learning, Dynamic Programming).
To explore the use of database and tool names, we have developed an annotated set of 60 full-text articles from the PubMed Central  open-access subset. The articles were randomly selected from Genome Biology (5 articles), BMC Bioinformatics (36) and PLoS Computational Biology (19). These journals were selected as they could provide a broad overview of the bioinformatics and computational biology domain(s).
The articles were primarily annotated by a bioinformatician (GD) with experience in text mining. The annotation process included marking each database/software name mention. We note that associated designators of resources (e.g., words such as database, software) were included only if part of the official name (e.g., Gene Ontology). The inter-annotator agreement (IAA)  for the annotation of database and software names was calculated from five full-text articles randomly selected from the annotated corpus, which were annotated by a PhD student with bioinformatics and a text-mining background.
To assess the complexity, composition, variability and ambiguity of resource names, we performed an analysis of the annotated mentions. The corpus was pre-processed using a typical text-mining pipeline consisting of a tokeniser, sentence splitter and part-of-speech (POS) tagger from GATE’s ANNIE . We analysed the length of names, their lexical (stemmed token-level) and structural composition (using POS tag patterns) and the level of variability and ambiguity as compared to common English words, acronyms and abbreviations.
In addition to the dataset of 60 articles that was used for analysis and development of NER tools, an additional dataset of 25 full-text annotated papers was created to assess the quality of the proposed NER approaches (see below).
Dictionary-based approach (baseline)
Sources from which the database and software name dictionary is comprised
Manually added entries
Our dictionary (DB, SW, PK)
Machine learning approach
Given the availability of the manually annotated corpus, a machine learning (ML) approach was explored for identification of resource names. We approached the task as a sequence-tagging problem as often adopted in NER systems. We opted for Conditional Random Fields (CRF)  and used features at the token-level that comprised the token’s own characteristics and the features of the neighbouring tokens. We used the Beginning-Inside-Outside (B-I-O) annotation.
- 1.Orthographic features captured the orthographic patterns associated with biomedical resources mentions. For example, a large percentage of mentions are acronyms (e.g., GO, SCOP), capitalised terms (e.g., Gene Ontology, Bioconductor) or words that contain a combination of capital and lower cap letters (e.g., MySQL, UniProt) etc. We engineered two groups of orthographic features . The first group comprised shape (pattern) features that mapped a given token to an abstract representation. Each capital letter is replaced with “X”, lower case letter with “x”, a digit with “d” and any other character with “S”. Two features were created in this group: the first feature contained a mapping for each character in a token (e.g., MySQL was mapped to “XxXXX”); the second feature mapped a token to a four character string that contained indicators of a presence of a capital letter, a lower letter, a digit or any other character (absence was mapped to a “_”), e.g., MySQL was mapped to “Xx_ _”. The features in the second group captured specific orthographic characteristics (e.g., is the token capitalised, does it consist of only capital letters, does it contain digits, etc. – see Table 2 for the full list), which were extracted by a set of regular expressions.Table 2
Token-specific orthographic features extracted by regular expressions
token is an acronym
all the letters in the token are capitalised
token is capitalised
token contains at least one capital letter
token contains at least one digit
token is made up of digits only
Dictionary features were represented by a single binary feature that indicated if the given token was contained within our biomedical resources dictionary.
Lexical features included the token itself, its lemma and part-of-speech (POS) tag.
Syntactic features were extracted from syntactic relations in which the phrase was a governor or a dependant, as returned by the Stanford parser [35, 36]; in cases where there were several relations, the relation types were alphabetically sorted and concatenated (e.g., “pobj” and “advmod” were combined as “advmod_pobj”).
The experiments on the training data revealed that two tokens before and one token after the current token provided the best performance. The CRF model was trained using CRF++ . All pre-processing needed for feature extraction was provided by the same text-mining pipeline as used for the corpus analysis and dictionary-based approach.
Machine learning – post-processing
An analysis of the initial CRF results on the development dataset revealed that a large portion of false negatives were from resource mentions that were recognised by the model at least once in a document, but missed elsewhere within the same document. We have therefore designed a two-pass post-processing approach. The first pass collected and stored all the CRF tagging results. These were then used to re-label the tokens in the second pass. In order to avoid over-generation of labels (i.e., possible false positives), we created a set of conditions that each token had to meet if it was to be re-labelled as a resource mention. First, it had to be labelled as a (part of a) resource name in the first pass more often than it was not, looking at the entire corpus that was being tagged. If that was the case, the candidate token also had to fulfil one of the following two conditions: either it was contained within the biomedical resources dictionary; or it was an acronym that had no digits and was at least two characters long. Finally, the following four tokens: “analysis”, “genomes”, “cycle” and “cell” were never labelled as part of resource name in the second round, as they were found to be the source of a large percentage of false positives.
Standard text-mining performance statistics (precision, recall, F-score) were used for evaluation. In particular, we make use of 5-fold cross-validation across all 60 full-text articles for both the dictionary and machine learning approaches. For a fair comparison, the dictionary-based approach is only evaluated on the test set in each fold, as it requires no prior “training”. We also test both approaches directly on the test set of 25 articles without additional training/adjustments.
Results and discussion
Statistics describing the manually annotated corpora
Total number of documents
Total database and software mentions
Total unique resource mentions
Percentage of database mentions
Percentage of unique database mentions
Average mentions per document
Average unique mentions per document
Maximum mentions in a single document
Maximum unique mentions in a single document
Resources with only a single lexicographic mention
In the development corpus, there were 401 lexically unique resources mentioned 2416 times (6 mentions on average per unique resource name), with an average of 40 resource mentions per document. The document with the most mentions had 227 resource mentions within it. Finally, 50 % of resource names were only mentioned once in the corpus. A similar profile was noted for the test corpus, although it contained notably more resource mentions per document.
Database and software name composition
We first analysed the composition of resource names both in the development corpus and dictionary. The longest database/software name in the annotated corpus contained ten tokens (i.e., Search Tool for the Retrieval of Interacting Genes/Proteins). However, there are longer examples in the dictionary (e.g., Prediction of Protein Sorting Signals and Localisation Sites in Amino Acid Sequences).
Internal POS structure of database and software names (the development corpus)
NNP NNP NNP
NNP NNP NNP NNP
Variability of resource names
To evaluate the variability of resource names within our dictionary, we calculated the average number of name variants for a given resource. As such, the variability of resource names at the dictionary level is 1.13 (6929 unique variants over 6126 resources, after adjustment for repeats). For the corpus analysis, we manually grouped the names from the set of manually annotated mentions that were referring to the same resource in order to analyse name variability. Specifically, we grouped variants based on spelling errors and orthographic differences, and then grouped long and short form acronym pairs based on our own background knowledge, and the text from which they were initially extracted. Of the 401 lexically unique names, 97 were variants of other names, leaving 304 unique resources. In total, 231 resources had only a single name variant within the corpus (76 %); 18 % of resources had two variants, and the final 6 % had between three and five variants. Of the 97 name variants, 36 were acronyms and most of those were defined in text (and so could perhaps be automatically expanded with available tools, e.g., ). However, there were other cases where a resource’s acronym was used without the expanded form for definition (e.g., BLAST).
Ambiguity of resource names
As expected, a number of ambiguous resource names exist within the bioinformatics domain. Interesting examples include Network  (a tool enabling network inference from various biological datasets) and analysis  (a package for DNA sequence analysis). We therefore analysed our dictionary of database and software names to evaluate dictionary-level ambiguity when compared to the entries in a full English words dictionary derived from a publicly available list  (hereafter referred to as the “English dictionary”) and to a known biomedical acronyms dictionary compiled from ADAM  (hereafter referred to as the “acronym dictionary”), consisting of 86,308 and 1933 terms, respectively. A total of 52 names matched English words (e.g., analysis, cycle, graph) and 77 names fully matched known acronyms (e.g., DIP, distal interphalangeal and Database of Interacting Proteins) when using case-sensitive matching. The number of matches increases to 534 to the English dictionary and to 96 for the acronym dictionary when case-insensitive matching is used instead.
To evaluate the recognition-level ambiguity within the annotated corpus, we also compared the annotated database and software names to the English dictionary and acronym dictionary. This resulted in four matches to the English dictionary (ACT, blast, dot, R), and six to the acronym dictionary (BBB, CMAP, DIP, IPA, MAS, VOCs) using case-sensitive matching. This equates to roughly 3 % of the unique annotated names. The total increases to 53 matches (17 %) if case-insensitive matching is used instead.
Evaluation results on the development and test corpora
CRF with post-processing
CRF without post-processing
CRF with post-processing
CRF without post-processing
Dictionary matching results on the development corpus
Machine learning results with post-processing on the development corpus
Machine learning results without post-processing on the development set
Combined dictionary and machine learning results on the development set
Feature impact analysis for the ML model
Feature impact analysis of the machine learning model without post-processing on the development set
No lexical features
No syntactic features
No orthographic features
No dictionary features
Overall, the lexical features were beneficial: when this group of features was removed, there was a drop of 8 % in precision, 6 % in recall, resulting in a 7 % lower F-score. The syntactic features had only a slight impact on the performance: removing this group resulted in a 1 % drop in both precision and recall and a 2 % in F-score. The orthographic features had a similar effect as the lexical features: when these were removed, there was an 8 % loss in precision, a 6 % loss in recall, resulting in a 7 % loss in F-score. Surprisingly, removing the dictionary features did not result in a high decrease in performance (there was a drop of 8 % in precision, a 5 % drop in recall and thus a 6 % drop in F-score), suggesting that the ML-model (without the aid of a dictionary), even with the relatively limited amount of training data, managed to capture a significant number of resource mentions.
Missed database and software mentions
Types of textual patterns and clues for identification of database and software names
Contribution to total TPs
Machine learning matches
Heads and Hearst Patterns
References and URLs
Example clues and phrases appearing with specific heads or in Hearst patterns
… the stochastic simulator Dizzy allows …
The MethMarker software was …
… tools: CLUSTALW, …, and MUSCLE.
… programs such as Simlink, …, and SimPed.
Example phrases from title appearances
CoXpress: differential co-expression in gene expression data
TABASCO: A single molecule, base-pair resolved gene expression simulator
SimHap GUI: An intuitive graphical user interface for genetic association analysis
Another clue is that database and software mentions are frequently followed by either a reference or a web URL (e.g., “Galaxy  and EpiGRAPH ”; PMC2784320). This was the main indicator used by OReFiL . We recognise, however, that web URLs and citations are not only used for resources, and so this is far less reliable than the previous options (for example, this approach could incorrectly capture “The learning metrics principle [14, 15]”; PMC272927). Restricting this clue to a paper’s Methods section may reduce the potential impact on precision.
Example versioning clues
… using dot v1.10 and Graphviz 1.13(v16).
CLUSTAL W version 1.83
Dynalign 4.5, and LocARNA 0.99
Example expressions that functionally indicate database and software mentions
… the SimHap GUI installation.
… implemented within PedPhase …
MethMarker therefore provides …
A typical screenshot of MethMarker …
Cofolga2 has six free parameters …
MethMarker’s user interface reflects …
MethMarker can directly import …
xPedPhase thus needs cubic time …
Examples of comparisons between database and software names
… the numbers of breakpoint sites by xPedPhase were equal to the numbers of breakpoints by i Linker…
xPedPhase did better than i Linker…
Cofogla2 with this cutoff PSVM gives a better false positive rate compared to RNAz…
Foldalign was much slower than Cofolga2 except for…
Like Moleculizer, Tabasco dynamically generates…
Example phrases with no clear or discriminative clues
Additionally, i Linker has an error correction step that detects unlikely crossover events.
In addition, Tabasco should be a good base to further study interactions on DNA…
PSPE is not only able to use one of many common models of nucleotide substitution…
The results show that LibSELDI tends to have a considerable advantage in the low FDR region…
The structure of Tabasco confers at least four advantages.
False positive filtering
Some typical false positive mistakes returned by the CRF models include mentions of programming languages and their libraries (e.g., Python, BioPython), algorithms/methods (e.g., Euclidean – a distance measure, BLOSUM – a similarity scoring matrix), file formats (e.g., FASTA), companies and organisations (e.g., EBI – the European Bioinformatics Institute). While we have explicitly excluded these types from the current task, they can still be useful indicators of bioinformatics practice. Another large class of errors, like with the dictionary approach alone, is with matches of GO sub-string within database identifiers (e.g., GO:0007089). Finally, ambiguous acronyms are typically returned as errors, but could be checked by searching for a definition within the document.
We note that there is not always a clear distinction between database and software names, methods, approaches, algorithms, programming languages, database records/identifiers, and file formats. We have decided to focus on “executables” and datasets as our ultimate aim is to help reconstruct the bioinformatics workflow that has been used within a given paper, so that we can support experiment replication and reproduction. The problem occurs because authors often introduce a novel algorithm and associated implementation (e.g., as a service or a stand-alone application), but frequently refer to their contribution only as an algorithm (or method), rather than software (or vice-versa). As such, although they are talking about their algorithm throughout the paper, it could be argued that they are referring to their software implementation, especially when talking about benchmark improvements in results. The fuzzy boundary between these definitions is a challenge for any focused automated system to overcome. Still, this distinction may not be relevant for some applications.
In this paper we presented an exploration of variability and ambiguity of database and software mentions in the bioinformatics and computational biology literature. Our results suggest that database and software NER is a non-trivial task that requires more than just a dictionary matching approach, even when using comprehensive resource inventories. Due to bioinformatics’ focus on resource creation, a dictionary would never be sufficiently comprehensive, making resource recognition potentially as hard as gene recognition (in contrast to species recognition, which is a relatively stable domain). Example names such as Network and analysis provide sources of ambiguity, whereas acronyms and verbalised references to software such as BLASTed provide issues of variability that need to be overcome.
The results of our ML-model show that dictionary-based predictions can be significantly improved. While ML achieved a major increase in precision, boosting recall proved to be challenging, indicating that additional attributes need to be included for accurate biomedical resource recognition.
Our analyses also provided a series of clues that could be picked up by text-mining techniques. As many of these clues are ambiguous on their own, an approach would be to combine various evidence (e.g., using voting and threshold) in order to capture database and software names more accurately (see, for example, ). Further work could combine these rules with the machine learning system to further increase the overall system accuracy, perhaps helping to recover some of the lost recall.
Availability of supporting data
The datasets supporting the results of this article are available at: http://sourceforge.net/projects/bionerds/.
Basic local alignment search tool
Conditional random fields
Named entity recognition
Portable document format
Structural classification of proteins
We would like to thank Daniel Jamieson (University of Manchester) for his help in establishing the inter-annotator agreement. GD is funded by a studentship from the Biotechnology and Biological Sciences Research Council (BBSRC) to RS, GN and DLR. The work of AK and GN is partially funded by the projects III44006 (GN) and III47003 (AK and GN, the Serbian Ministry of Education and Science). We also thank authors of sites listed in Table 1 for freely providing inventories of database and tool names.
The first version of this manuscript appeared in the Semantic Mining in Biology and Medicine (SMBM) 2012 symposium.
- Duck G, Nenadic G, Brass A, Robertson DL, Stevens R. Extracting patterns of database and software usage from the bioinformatics literature. Bioinformatics. 2014;30:i601–8.View ArticleGoogle Scholar
- Eales JM, Pinney JW, Stevens RD, Robertson DL. Methodology capture: discriminating between the “best” and the rest of community practice. BMC Bioinformatics. 2008;9:359.View ArticleGoogle Scholar
- Stevens R, Glover K, Greenhalgh C, Jennings C, Pearce S, Li P, et al. Performing in silico experiments on the grid: a users perspective. In: Proc UK e-Science Program All Hands Meet; 2003. p. 43–50.
- Brazas MD, Yim DS, Yamada JT, Ouellette BFF. The 2011 bioinformatics links directory update: more resources, tools and databases and features to empower the bioinformatics community. Nucleic Acids Res. 2011;39 Suppl 2:W3–7.View ArticleGoogle Scholar
- Galperin MY, Cochrane GR. The 2011 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic Acids Res. 2011;39(Database issue):D1–6.View ArticleGoogle Scholar
- ExPASy: SIB Bioinformatics Resource Portal. [http://expasy.org/]
- Chen Y-B, Chattopadhyay A, Bergen P, Gadd C, Tannery N. The Online Bioinformatics Resources Collection at the University of Pittsburgh Health Sciences Library System–a one-stop gateway to online bioinformatics databases and software tools. Nucleic Acids Res. 2007;35(Database issue):D780–5.View ArticleGoogle Scholar
- Duck G, Nenadic G, Brass A, Robertson DL, Stevens R. bioNerDS: exploring bioinformatics’ database and software use through literature mining. BMC Bioinformatics. 2013;14:194.View ArticleGoogle Scholar
- Zweigenbaum P, Demner-Fushman D, Yu H, Cohen KB. Frontiers of biomedical text mining: current progress. Brief Bioinform. 2007;8:358–75.View ArticleGoogle Scholar
- Gerner M, Nenadic G, Bergman CM. LINNAEUS: a species name identification system for biomedical literature. BMC Bioinformatics. 2010;11:85.View ArticleGoogle Scholar
- Hirschman L, Yeh A, Blaschke C, Valencia A. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics. 2005;6 Suppl 1 Suppl 1:S1.View ArticleGoogle Scholar
- Settles B. ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics. 2005;21:3191–2.View ArticleGoogle Scholar
- Hakenberg J, Plake C, Leaman R, Schroeder M, Gonzalez G. Inter-species normalization of gene mentions with GNAT. Bioinformatics. 2008;24:i126–32.View ArticleGoogle Scholar
- Kolluru B, Hawizy L, Murray-Rust P, Tsujii J, Ananiadou S. Using workflows to explore and optimise named entity recognition for chemistry. PLoS One. 2011;6:e20181.View ArticleGoogle Scholar
- Dingare S, Nissim M, Finkel J, Manning C, Grover C. A system for identifying named entities in biomedical text: how results from two evaluations reflect on both the system and the evaluations. Comp Funct Genomics. 2005;6:77–85.View ArticleGoogle Scholar
- Leser U, Hakenberg J. What makes a gene name? Named entity recognition in the biomedical literature. Brief Bioinform. 2005;6:357–69.View ArticleGoogle Scholar
- Yamamoto Y, Takagi T. OReFiL: an online resource finder for life sciences. BMC Bioinformatics. 2007;8:287.View ArticleGoogle Scholar
- De la Calle G, García-Remesal M, Chiesa S, de la Iglesia D, Maojo V. BIRI: a new approach for automatically discovering and indexing available public bioinformatics resources from the literature. BMC Bioinformatics. 2009;10:320.View ArticleGoogle Scholar
- Duck G, Stevens R, Robertson D, Nenadic G. Ambiguity and Variability of Database and Software Names in Bioinformatics. In: Ananiadou S, Pyysalo S, Rebholz-Schuhmann D, Rinaldi F, Salakoski T, editors. Proc 5th Int Symp Semant Min Biomed; 2012. p. 2–9
- Kovačević A, Konjović Z, Milosavljević B, Nenadic G. Mining methodologies from NLP publications: A case study in automatic terminology recognition. Comput Speech Lang. 2012;26:105–26.View ArticleGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995;247:536–40.Google Scholar
- The UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012;40(Database issue):D71–5.View ArticleGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000;25:25–9.View ArticleGoogle Scholar
- Home - PubMed - NCBI. [https://www.ncbi.nlm.nih.gov/pubmed].
- Software - Wikipedia, the free encylopedia. [https://en.wikipedia.org/wiki/Software]
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–10.View ArticleGoogle Scholar
- Sayers E, Wheeler D. Building Customized Data Pipelines Using the Entrez Programming Utilities (eUtils). In: NCBI Short Courses [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2004.Google Scholar
- R Development Core Team. R: A Language and Environment for Statistical Computing. 2011
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5:R80.View ArticleGoogle Scholar
- Roberts RJ. PubMed Central: The GenBank of the published literature. Proc Natl Acad Sci U S A. 2001;98:381–2.View ArticleGoogle Scholar
- Kim J-D, Tsujii J. Corpora and Their Annotation. In: Ananiadou S, McNaught J, editors. Text Min Biol Biomed. Boston and London: Artech House; 2006. p. 179–211.Google Scholar
- Cunningham H, Maynard D, Bontcheva K, Tablan V, Aswani N, Roberts I, et al. Text Processing with GATE (Version 6). University of Sheffield Department of Computer Science; 2011. https://gate.ac.uk/books.html
- Lafferty JD, McCallum A, Pereira FCN. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proc Eighteenth Int Conf Mach Learn. Morgan Kaufmann Publishers Inc; 2001. p. 282–289.
- Kovačević A, Dehghan A, Filannino M, Keane JA, Nenadic G. Combining rules and machine learning for extraction of temporal expressions and events from clinical narratives. J Am Med Informatics Assoc. 2013;20:859–66.View ArticleGoogle Scholar
- De Marneffe M-C, MacCartney B, Manning CD. Generating Typed Dependency Parses from Phrase Structure Parses. In: Lr 2006; 2006
- Klein D, Manning CD. Accurate unlexicalized parsing. In: Proc 41st Annu Meet Assoc Comput Linguist - Vol 1. Sapporo, Japan: Association for Computational Linguistics; 2003. p. 423–30.Google Scholar
- CRF++. [http://crfpp.sourceforge.net/].
- Porter Stemming Algorithm. [http://tartarus.org/martin/PorterStemmer/]
- Torii M, Hu Z, Song M, Wu CH, Liu H. A comparison study on algorithms of detecting long forms for short forms in biomedical text. BMC Bioinformatics. 2007;8 Suppl 9 Suppl 9:S5.View ArticleGoogle Scholar
- Free Phylogenetic Network Software. [http://www.fluxus-engineering.com/sharenet.htm]
- Thornton K. libsequence: a C++ class library for evolutionary genetic analysis. Bioinformatics. 2003;19:2325–7.View ArticleGoogle Scholar
- Kevin’s Word List Page. [http://wordlist.sourceforge.net/]
- Zhou W, Torvik VI, Smalheiser NR. ADAM: another database of abbreviations in MEDLINE. Bioinformatics. 2006;22:2813–8.View ArticleGoogle Scholar
- Hearst MA. Automatic acquisition of hyponyms from large text corpora. In: Proc 14th Conf Comput Linguist - Vol 2. Morristown, NJ, USA: Association for Computational Linguistics; 1992. p. 539–45.View ArticleGoogle Scholar
- Southan C, Cameron G. Database Provider Survey. 2009. p. 1–58
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated