Biotea: RDFizing PubMed Central in support for the paper as an interface to the Web of Data
© Garcia Castro et al; licensee BioMed Central Ltd. 2013
Published: 15 April 2013
Skip to main content
Volume 4 Supplement 1
© Garcia Castro et al; licensee BioMed Central Ltd. 2013
Published: 15 April 2013
The World Wide Web has become a dissemination platform for scientific and non-scientific publications. However, most of the information remains locked up in discrete documents that are not always interconnected or machine-readable. The connectivity tissue provided by RDF technology has not yet been widely used to support the generation of self-describing, machine-readable documents.
In this paper, we present our approach to the generation of self-describing machine-readable scholarly documents. We understand the scientific document as an entry point and interface to the Web of Data. We have semantically processed the full-text, open-access subset of PubMed Central. Our RDF model and resulting dataset make extensive use of existing ontologies and semantic enrichment services. We expose our model, services, prototype, and datasets at http://biotea.idiginfo.org/
The semantic processing of biomedical literature presented in this paper embeds documents within the Web of Data and facilitates the execution of concept-based queries against the entire digital library. Our approach delivers a flexible and adaptable set of tools for metadata enrichment and semantic processing of biomedical documents. Our model delivers a semantically rich and highly interconnected dataset with self-describing content so that software can make effective use of it.
For over 350 years, scientific publications have been fundamental to advancing science. Since the first scholarly journals, Philosophical Transactions of the Royal Society (of London) and the Journal de Sçavans, scientific papers have been the primary, formal means by which scholars have communicated their work, e.g., hypotheses, methods, results, experiments, etc. . Advances in technology have made it possible for the scientific article to adopt electronic dissemination channels, from paper-based journals to purely electronic formats. By the same token, scholarly communication has been complemented by the adoption of blogs, mailing lists, social networks, and other technologies that in combination support the tissue, by means of which scholars communicate their work and establish connections with one another. However, in spite of the advances, scientific publications remain poorly connected to each other as well as to external resources. Furthermore, most of the information remains locked up in discrete documents without machine-processable content. Such interconnectedness and structuring would facilitate interoperability across documents as well as between publications and online resources resources available online. Scholarly data and documents are of most value when they are interconnected rather than independent .
In an effort to add value to the content of scientific publications, publishers are actively improving programmatic access to their products. For instance, Nature Publishing Group (NPG) recently released 20 million Resource Description Framework (RDF) statements, including primary metadata for more than 450,000 articles published by NPG since 1869. In this first release, the dataset includes basic citation information (title, author, publication date, etc.), identifiers, and Medical Subject Headings (MeSH) terms. Their data model makes use of vocabularies such as the Bibliographic Ontology (BIBO) , Dublin Core Metadata Initiative (DCMI) [4, 5], Friend of a Friend (FOAF) [6, 7], and the Publishing Requirements for Industry Standard Metadata (PRISM)  as well as ontologies that are specific to NPG . Similarly, Elsevier provides an Application Programming Interface (API) that makes it possible for developers to build specialized applications .
In this paper, we present our knowledge model for biomedical literature. We aim at delivering interoperable, interlinked, and self-describing documents in the biomedical domain. We applied our approach to the full-text, open-access subset of PubMed Central (PMC) . PMC is a free full-text archive of biomedical literature; currently, it includes 1,679 journals. PMC provides an open-access subset; articles in this subset are still protected by copyright but are also available under the Creative Commons license, i.e., a more liberal redistribution is allowed. Articles are available as Extensible Markup Language (XML) files downloadable via File Transfer Protocol (FTP). In our approach, existing ontologies are brought together in order to facilitate the representation of sections in scientific literature as well as the identification of biologically meaningful fragments. These are pieces of text corresponding to proteins, chemicals, drugs, or diseases, among other biological concepts, within those previously identified sections. By delivering a semantic infrastructure for scientific publications, i.e., a semantic dataset, we are supporting interoperability as publications are linked to each other and to biological resources. By embedding biomedical literature in the WoD it is possible for users and developers to benefit from the advantages offered by the Linked Open Data (LOD) cloud.
Our RDFization process orchestrates ontologies such as the Documents Components Ontology (DoCO) , BIBO , DCMI [4, 5], and FOAF [6, 7]; these namespaces have been added to our SPARQL endpoint so that users do not need to define them as prefixes. Meaningful fragments within sections are automatically marked and enriched by adding annotations. Such annotations are structured with the Annotation Ontology (AO) . In our model, we follow the four principles proposed by Tim Berners-Lee for publishing Linked Data : (i) using Uniform Resource Identifiers (URIs) to identify things, (ii) using Hyper Text Transfer Protocol (HTTP) URIs to enable things to be referenced and looked up by software agents, (iii) representing things in RDF and providing a SPARQL endpoint, and (iv) providing links to external URIs in order to facilitate knowledge discovery.
The connectivity tissue supported by our dataset makes it possible to establish networks of associated concepts across papers (NACAP) . In this way, the retrieved set can be represented as a graph where nodes are articles and shared terms are edges. This graph-based navigation allows users to realize how heavily two or more articles are interconnected as well as what terms are shared –see section “Gene-based search and retrieval, a first prototype” for a complete description of our current implementation. Our semantically enriched dataset also makes hierarchy-based searching possible. Based on the hierarchy of classes, retrieval can be widened to direct ascendants or narrowed to direct descendants; thus, the dataset can be navigated by going up or down in the hierarchy.
Retrieving articles based on content
Query expressed in natural language
Retrieving PubMed identifier, article title, section title, and paragraphs for those articles containing the term “cancer” in any section whose title includes “introduction”
Retrieving articles based on annotations
Query expressed in natural language
Retrieving PubMed identifier for those articles that have been semantically annotated with the biological entity CHEBI:60004. The semantic annotation comes from the occurrence of the term “mixture” in any paragraph of the retrieved articles.
In addition to downloadable RDF documents, we also provide a web services API for the Biotea collection. The API is available at http://biotea.idiginfo.org/api and provides services for information retrieval by using predefined indexes encapsulated behind a set of simple read-only Representation State Transfer (REST) services. The REST mechanisms facilitate retrieval of documents through the discovery of terms, topics, and vocabularies within the documents. The following specific services are available:
http://biotea.idiginfo.org/api/→Provides information about the API;
http://biotea.idiginfo.org/api/topics→ Query collection based on topics;
http://biotea.idiginfo.org/api/terms→ Query collection based on terms;
http://biotea.idiginfo.org/api/vocabularies→Query collection based on vocabularies; and
http://biotea.idiginfo.org/api/documents→Retrieves RDFized documents.
API retrieval support
A list of terms and their related topics
A list of topics and their related vocabularies
All topics related to a term
All vocabularies related to a term
All terms that start with a specific string (for autocompletion)
All topics related to a vocabulary
RDF of articles that include a term
Count of RDF of articles that include a term
A list of vocabularies and their prefixes
RDF of articles that include a vocabulary
Developers can make use of our API and downloadable dataset to create web applications and mash-ups. In order to illustrate some of the possibilities, we have developed a prototype application that makes it possible for users to search for human genes in PMC. Unlike search & retrieval tools that return only the target documents from a given search, our prototype search tool returns an enriched visualization of the annotated terms and the biological entities related to them. Biomedical entities are fully identified, enabling the association of specific tools to enhance the reading experience, e.g., sequence browsers, 3D viewers, etc. Entities are also linked to resources such as Bio2RDF, thus immersing the content in the WoD.
Users initiate a search by typing the name of a gene. From the gene name, the corresponding protein accession is retrieved from the GeneWiki RDF. GeneWiki is a Wikipedia-based project comprising over 10,000 pages about human genes, and it includes mappings to proteins and diseases. Protein accessions in GeneWiki follow the Universal Protein Resource (UniProt) nomenclature. UniProt is a consortium that provides free access to a high-quality collection of protein sequences and functional annotations .
Unlike tools such as Reflect , our prototype makes use of BioJS components that are able to interact with each other. For instance, whenever the selection over a protein sequence changes, the interface highlights the corresponding amino acids in the 3D structure. As data types are fully identified, further manipulation becomes possible. In this way, we deliver a rich and interactive user experience.
Bio2RDF [18, 19] is a project that makes biomedical data available by using Semantic Web technologies such as RDF and SPARQL. Bio2RDF brings together information from diverse public databases such as Kyoto Encyclopedia of Genes and Genomes (Kegg), Protein Databank (PDB), UniProt, NCIt, and PubMed, amongst others. Both Bio2RDF and RDF4PMC aim to support biological knowledge discovery, the former by providing a single access point to several biomedical data sources, and the latter by delivering a semantically enriched information layer on top of PMC articles. Our RDFized PMC articles are semantically rich and deeply related to biomedical data sources available via Bio2RDF, and the WoD at large, so ensuring that they are fully compliant with Bio2RDF is critical for bridging the gap between research articles and biomedical data sources. Mappings and processes will be available via the Bio2RDF GitHub repository.
We have generated a semantically enriched version of PMC. Our model makes extensive reuse of existing vocabularies. Annotations are scaffolded by using the AO, domain knowledge is identified by means of domain ontologies, and documents are structured by using DOCO, BIBO, DCMI Terms, and others. Some of the difficulties we have had are: (i) at least four different formats are used to model references in PMC XMLs, (ii) authors’ names are represented with initials and last name, making it difficult to disambiguate them, (iii) FOAF for authors and institutions are not provided, and (iv) annotation services were sometimes unavailable during processing –services are not always reliable. In order to deal with different reference styles, we create specific methods for each one and transform them into a common RDF model following the recommendations from BIBO. We also create FOAF elements for authors and institutions and assign them resolvable URIs; in this way, it would be possible to use tools such as http://sameas.org for defining relationships between our FOAF and the WoD. When combined with social mechanisms, this technique may be used to disambiguate article contributors; authors could claim publications, so the FOAF could be consolidated. The problems with the annotation services were resolved by reprocessing files whenever needed. For the Gene Ontology (GO), National Drug File – Reference Terminology (NDFRT), Foundational Model of Anatomy (FMA), Symptom Ontology (SYMP), and the International Classification of Diseases, vr.10 (ICD10), it was necessary to update some of the term-related URIs and reprocess the annotations. GO was reprocessed with Whatizit; we are using the NCBO annotator for FMA, NDFRT, SYMP, and ICD10.
We have generated an interoperable semantic dataset. Models such as that of NPG do not link to existing vocabularies, e.g., MeSH, in a semantic way. Instead, they include plain literals, making it difficult to use this information for knowledge discovery. Our model links to well-known vocabularies relevant in the biomedical domain. Similar to the NPG experience, we also rely on ontologies such as BIBO in order to model metadata. Since we are targeting only open-access documents within PMC, we also include the content of the document. Similar to Reflect and UTOPIA, we integrate content from different resources into scientific publications; in our dataset, this is achieved by means of the semantics annotations added to the articles. In our case, this integration is persistent and interoperable with other resources.
Using ontologies to annotate concepts in scientific publications is a common practice in PubMed; curators use MeSH to annotate PubMed. Finding hidden relations by using semantic annotations has been reported by several authors in the biomedical domain [34–36]. For instance, patterns across the MeSH terms have been used to identify potential new associations between drugs and diseases . Also, annotations shared by a group of genes have contributed to identify possible relationships between these genes [34, 35]. The entity recognition systems we are using, namely Whatizit and the NCBO Annotator, deliver annotations as HTTP URIs; in this way, annotated concepts can be referenced and searched by software. By structuring annotations in RDF, we are facilitating interoperability between the content and the WoD. For instance, it is possible to retrieve documents, focusing on the section “Materials and Methods”, about gene expression in whole blood from cancer patients and complementing the results with information about drugs and chemical entities that might act on a certain protein. The possibilities are limited to existing links in the LOD cloud and the availability of such resources over SPARQL endpoints.
Scientific literature is intrinsically related to each other via citations. It has been considered that two citing articles are similar to the extent that they cite the same literature ; such is the nature of citation-based approaches (e.g., co-citation analysis, bibliographic coupling). Text-based approaches (e.g., term frequency-inverse document frequency (tf-idf), latent semantic analysis) can also be used to measure similarity across documents . Lewis  suggests text similarity as an alternative search mechanism for the Medline database ; articles are initially grouped by means of a keyword-based algorithm and then ordered following a similar algorithm based on sentence alignment. McSyBi  utilizes clustering techniques in order to make sub-topics explicit from a set of citation data retrieved from PubMed; clusters are created based on information from the title and the abstracts. Users can modify the cluster by specifying a MeSH term or a UMLS Semantic Type; this makes it easier for users to obtain different graphs that can be analyzed from different perspectives depending on the terms used as secondary input. Unlike McSyBi, PuRed-MCL  does not analyze the content of the article; instead, it relies on a set of pre-computed relations from PubMed. Those related documents are then used to build a graph that is processed by a clustering technique. Clusters and documents are annotated with MeSH and chemical substances; visualization is also supported. Most of the investigated approaches address the problem by using relatively small annotation graphs with few ontologies. As our dataset comprises annotations from over 20 ontologies, it provides an ideal playground for semantic similarity analysis across biomedical literature.
Our methods as well as the resulting dataset are an important part of the semantic infrastructure for PMC. We provide (i) the transformation into RDF from the original PMC files, (ii) the annotation of the RDF, and (iii) an API which makes that data available. New vocabularies as well as annotators can easily be plugged in, making it possible to enrich the semantics of the dataset by supporting use-cases not covered by those vocabularies and annotators that were initially used. Our approach is useful for both open and non-open access datasets; since the content is clearly identified and enriched with specialized vocabularies, publishers may decide what to expose via RDF and what content to make available.
To ensure the reproducibility of science, we envision that scientific articles will provide access to raw data and to computer-understandable descriptions of methodologies in order to support the recreation of the experiments being described. To aid in resolving inconsistencies, we expect in the future to relate and compare information across multiple documents. Semantic Web technologies should help to deliver a self-descriptive document that makes it possible to improve the user experience and change our understanding of scholarly communication. There should be a community-based platform providing FOAF for authors and institutions; such a platform could easily be part of publication submission systems. In this way, disambiguating authors will become much simpler.
We use BIBO, DCMI Terms, and the Provenance Ontology (PROV-O)  to model the bibliographic metadata. BIBO  provides classes and properties to represent citations and bibliographic references. BIBO can be used to model documents and citations in RDF or to classify documents within a hierarchy. BIBO reuses concepts from DCMI and PRISM –an XML specification that defines a controlled vocabulary for managing, aggregating, and publishing content; PRISM is also used in the RDF dataset provided by NPG. Dublin Core (DC) [4, 5] offers a domain-independent vocabulary to represent metadata; such vocabulary aims to facilitate cross-resource exploration. The DC vocabulary was initially released in 1995, and in 2008 the DCMI was created. It aims to provide an interoperable-online set of metadata standards. PROV-O is a working draft from the World Wide Web Consortium (W3C); W3C is the main international organization for standards in the World Wide Web (WWW). PROV-O provides classes and properties to represent and interchange provenance data.
Some of the properties provided by BIBO are similar to those found in the DCMI Terms; BIBO inherits some properties from DCMI Terms. We use bibo:pmid and bibo:doi as publication, domain-specific identifiers. We have also included dcterms:identifier because, being a domain-independent property, it is widely used and facilitates compatibility with existing RDF datasets such as that from NPG. Some of the properties provided by PROV-O are similar to those available in DCMI Terms. For instance, prov:wasAttributedTo is similar to dcterms:creator, and prov:generatedAtTime is similar to dcterms:created. We have included both PROV-O and DCMI Terms properties. Similar to the previous case, PROV-O is more specific, whereas DCMI Terms is more used.
In BIBO, authors are modeled either as rdf:List or rdf:Seq. In both cases, the range should be rdfs:Resource . By doing so, authors have been modeled as resources rather than as plain text. We use FOAF  to identify the affiliations of authors. Specifically, we use foaf:Person and foaf:Organization. FOAF provides a set of classes and properties to represent people and their connections to other people, organizations, and resources, e.g., publications. FOAF integrates information related to social networks. Such networking is also identifiable in publications; authors collaborate with co-authors and are affiliated to organizations. Authors could be represented as dcterms:Agent. However, we have used FOAF because it is more detailed and explicit. It includes elements such as first and last name, the institution to which the author belongs, personal homepage, and email account.
We use DoCO  to explicitly identify sections and paragraphs. CNT  is being used to represent the actual content in the paragraphs. DoCO provides a structured vocabulary written in OWL 2 DL of document components, both structural (e.g. block, inline, paragraph, section, chapter) and rhetorical (e.g. introduction, discussion, acknowledgements, reference list, figure, appendix). The content of paragraphs is modeled with CNT. CNT is a working draft from W3C – last released in May 2011, it aims to provide a flexible vocabulary for the representation of any type of content, e.g., text, XML, images, etc. It includes the encoding character, making it easier for machines to process the content.
In order to identify biological terms, we use two text-mining tools: Whatizit [24, 25] and the NCBO Annotator . Both tools are based on exact string matching and pre-defined dictionaries. Whatizit is based on monq.jfa , an open source Java library that binds regular expressions to actions; these actions are automatically executed whenever there is an exact string match between the dictionary and the processed text. In the case of Whatizit, an XML tag is added around the match. By doing so, relevant biological identifiers such as UniProt accessions and ChEBI and GO identifiers are added. The NCBO Annotator is based on Mgrep . Similar to Whatizit, the NCBO annotator identifies terms and associates them with biological entities. However, the NCBO Annotator also utilizes to its advantage the hierarchy in the vocabularies used for the association. It adds siblings and maps to equivalent terms in other ontologies.
The identified terms are represented in RDF following the model proposed by the AO . The AO facilitates the representation of annotations on static resources. The AO supports annotations expressed as plain text as well as those coming from ontologies. Annotations can be attached to a whole resource but also to portions of it, e.g., sentences, paragraphs, sections, etc. Annotations on specific parts of a document make use of selectors; a selector identifies a portion within the text. Depending on its nature, it may be aos:TextSelector to identify an exact match or aos:StartEndSelector for the initial and final position in the annotation. AO uses FOAF for annotators.
The NCBO Annotator is used with the following ontologies:
ChEBI for chemicals;
Pathway, and Functional Genomics Data Society (MGED) for genes and proteins;
Master Drug Data Base (MDDB), NDDF, and NDFRT for drugs;
SNOMED, SYMP, MedDRA, MeSH, MedlinePlus Health Topics (MedlinePlus), Online Mendelian Inheritance in Man (OMIM), FMA, ICD10, and Ontology for Biomedical Investigations (OBI) for diseases and medical terms;
PO for plants; and
MeSH, SNOMED, and NCIt for general terms.
Whatizit is used for GO, UniProt proteins, UniProt Taxonomy, and diseases mapped to the UMLS; UniProt taxa are also mapped to NCBI Taxon vocabulary. ChEBI, GO, and organisms are supported by both NCBO and Whatizit. For ChEBI, we chose the NCBO Annotator because it is faster than Whatizit. For GO, we chose Whatizit because it allows the use of either the Foundry-compliant URIs or the OBO legacy ones . In order to better align our effort with Bio2RDF, we are using the OBO legacy URIs. For organisms, we chose Whatizit because it recognizes more organisms than the NCBO Annotator; for example, neither “human” nor “mouse” was recognized by NCBO in any of our tests. Additionally, we included links to Bio2RDF for ChEBI, GO, MeSH, NCIt, UniProt proteins, UniProt Taxonomy, and NCBI Taxon.
In order to specify the location of an annotated term in a document, we have extended AO. Our extension makes it possible to select portions of text and represent them as RDF literals; the property whose object is the literal must be used only once in the annotated element. For example, the literal cnt:chars can only occur once per each doco:Paragraph in the article (<a doco:Paragraph> cnt:chars <a literal>). Extensions are shown below:
aold:ElementSelector → identifies an exact text in a literal, e.g.,cnt:chars, in an RDF element (extends aos:TextSelector); and
aold:StartEndElementSelector → like the previous one but also includes the start and end positions of the snippet in the text (extends aos:StartEndSelector).
Application Programming Interface
Chemical Entities of Biological Interest
Representing Content in RDF 1.0
Dublin Core Metadata Initiative
DoCO, the Document Components Ontology
Digital Object Identifier
Foundational Model of Anatomy
Friend of a Friend Ontology
Google Web Toolkit
Hyper Text Transfer Protocol
International Classification of Diseases - vr.10
International Standard Serial Number
Kyoto Encyclopedia of Genes and Genomes
Living Document Project
Linked Open Data
Master Drug Data Base
Medical Dictionary for Regulatory Activities
MedlinePlus Health Topics
Medical Subject Headings
Functional Genomics Data Society
Networks of Associated Concepts Across Papers
National Center of Biotechnology Information
National Center for Biomedical Ontology
National Cancer Institute Thesaurus
National Drug Data File
National Drug File - Reference Terminology Source
Nature Publishing Group
Ontology for Biomedical Investigations
Online Mendelian Inheritance in Man
Protein Data Bank
Portable Document Format
Publishing Requirements for Industry Standard Metadata
Representation State Transfer
Resource Description Framework
Semantic Digital Libraries
Semantic Science Integrated Ontology
Systematized Nomenclature of Medicine
SPARQL Protocol and RDF Query Language
Symptom Ontology, tf-idf: Term frequency-inverse document frequency
Unified Medical Language System
Universal Protein Resource
Uniform Resource Identifier
Web of Data
Extended Markup Language.
We thank John Gómez, the main developer of the gene-based search and retrieval prototype. We would also like to thank Jeremy Spinks and John Patterson for their technical support, and Diane Leiva and Melissa Carrion who proofread the final manuscript. We also thank the reviewers for their valuable comments. We would like to express our gratitude to Oscar Corcho, Robert Morris, and Dietrich Rebholz-Schuhmann for their comments on the manuscript. We would especially like to thank Michel Dumontier for his collaboration, suggestions, and hands-on work with the SIO mappings. Also, we thank Greg Riccardi for his continuous support. This work has been funded by US DoD Grant MOMRP Grant w81xwh-10-2-0181.
AGC, CM and the the journal submission have been funded by US DoD Grant MOMRP Grant w81xwh-10-2-0181. LJG has been self-funded. During the early stages of this project, 2010-2011 AGC was self-funded.
This article has been published as part of Journal of Biomedical Semantics Volume 4 Supplement 1, 2013: Proceedings of the Bio-Ontologies Special Interest Group 2012. The full contents of the supplement are available online at http://www.jbiomedsem.com/supplements/4/S1
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.