Establishing a distributed system for the simple representation and integration of diverse scientific assertions
© Samwald and Stenzhorn; licensee BioMed Central Ltd. 2010
Published: 22 June 2010
Skip to main content
Volume 1 Supplement 1
© Samwald and Stenzhorn; licensee BioMed Central Ltd. 2010
Published: 22 June 2010
Information technology has the potential to increase the pace of scientific progress by helping researchers in formulating, publishing and finding information. There are numerous projects that employ ontologies and Semantic Web technologies towards this goal. However, the number of applications that have found widespread use among biomedical researchers is still surprisingly small. In this paper we present the aTag (‘associative tags’) convention, which aims to drastically lower the entry barriers to the biomedical Semantic Web. aTags are short snippets of HTML+RDFa with embedded RDF/OWL based on the Semantically Interlinked Online Communities (SIOC) vocabulary and domain ontologies and taxonomies, such as the Open Biomedical Ontologies and DBpedia. The structure of aTags is very simple: a short piece of human-readable text that is ‘tagged’ with relevant ontological entities. This paper describes our efforts for seeding the creation of a viable ecosystem of datasets, tools and services around aTags.
Numerous biomedical datasets in aTag format and systems for the creation of aTags have been set-up and are described in this paper. Prototypes of some of these systems are accessible at http://hcls.deri.org/atag
The aTags convention enables the rapid development of diverse, integrated datasets and semantically interoperable applications. More work needs to be done to study the practicability of this approach in different use-case scenarios, and to encourage uptake of the convention by other groups.
A common challenge faced by biomedical researchers and clinicians is to quickly get an overview of publications and database entries for a certain biomedical topic, and to identify relevant, valid facts, research trends and contradictory findings from diverse sets of information sources. From here on, we will refer to this process as “key assertion integration”.
Novel information technologies might enable researchers to conduct these tasks in a more efficient and reliable manner. For example, Semantic Web technologies and ontologies hold the promise to enable the creation of smarter software systems that facilitate key assertion integration through better structuring of information, shared standards, clear semantics and global interlinking of data. These technologies have undergone remarkable progress in recent years. For instance, the Open Biomedical Ontologies (OBO) Foundry  and the BioPortal offered by the National Center for Biomedical Ontology (NCBO)  make a vast array of biomedical ontologies of great detail and high quality available to the public. Large, integrated biomedical knowledge bases have been created using the Resource Description Framework (RDF) and the Web Ontology Language (OWL), such as the Neurocommons Knowledge Base [3, 4] or the knowledge base  of the World Wide Web Consortium (W3C) Health Care and Life Science Interest Group (HCLSIG) . The Linked Data community  created a global network of RDF/OWL resources that consists of billions of statements, with each resource being available for fine-grained access based on established web standards.
However, there is still a widely recognized lack of applications that empower end-users to access, add and link to this structured, ontology-based information on the web. While many current biomedical RDF/OWL resources offer a wealth of valuable information, their structures are often highly complex and heterogeneous. Even though diverse datasets are interlinked and similarly formalized through RDF/OWL, the complexity and stylistic differences between them make it hard for developers to create applications that are both user-friendly and flexible enough to work with them without extensive customization .
In this paper we introduce a convention for using existing Semantic Web standards, vocabularies and ontologies in a way that is decidedly less complex and easier to implement in diverse software applications. We call this convention aTag (‘associative tag’).
The hypothesis that drives the aTag developments is that semi-structured data can be sufficient to tackle many realistic biomedical use-cases, and that simpler data structures are easier to integrate, understand and use than more complex ones. This hypothesis is inspired by experiences we made with creating, integrating and using large-scale ontology-based information repositories in recent years. In the case of aTags, a set of ontological entities is used to describe biomedical statements without capturing the actual relationships (RDF properties) between those entities.
In this paper we describe our efforts to implement this convention in a variety of web applications and try to demonstrate that this simpler approach can nonetheless be used to tackle realistic problems associated with the task of key assertion integration. A prototype of the software described here is accessible at http://hcls.deri.org/atag
aTags do not depend on any new vocabularies but are based on reusing the popular SIOC vocabulary  combined with entities from established ontologies and taxonomies, such as OBO ontologies or DBpedia [14, 15]. These entities are used to annotate short snippets of text containing assertions.
The number of ontological entities that can be used to ‘tag’ a specific aTag is not limited in any way. While the three tags used for the exemplary aTag shown above resemble a typical subject-predicate-object triple, the number of tags can also be smaller or larger, and the tags do not need to follow a subject-predicate-object pattern. This makes it possible to represent even complex statements without creating convoluted RDF graphs. For example, a statement such as “Protein A interacts with protein B in tissue C at time D” could be represented as an aTag that is tagged with five entities: ‘Protein A’, ‘molecular interaction’, ‘protein B’, ‘tissue C’, ‘time D’. Albeit not all relationships between the entities are explicitly encoded, a wide variety of queries could be answered with a dataset that is made up of aTags following this pattern. In comparison, trying to fully represent the statement above with explicit RDF triples (e.g., using the Basic Formal Ontology (BFO)  and the Relation Ontology ) would result in an RDF graph that would be quite complex and significantly more difficult to query.
Obviously, this type of representation cannot capture all the details of complex assertions. For example, a small interaction network containing more than two proteins that interact in more than one manner cannot be represented as an aTag without information loss. Such limitations to expressivity are consciously accepted, based on the rationale that in many scenarios the gain in utility by keeping things simple outweighs the loss in utility by limited expressivity.
The RDFa embedded within the HTML is also carried along when parts of web documents are copied and pasted to other locations, making it possible to re-arrange and merge datasets with the ease of re-arranging free text in a What-You-See-is-What-You-Get (WYSWYG) editor.
Currently available data sources in aTag format.
Source of entities used for tagging
Drug side-effect data. Size: 100,000 RDF statements.
PDSP Ki Database
Receptor-ligand interactions quantified by Ki value, emphasis on data about psychoactive substances.
PubMed Conclusion sections
Conclusion sections extracted from PubMed abstracts, annotated with Medical Subject Headings. Size: 1,8 million RDF statements.
SKOS version of Medical Subject Headings (MeSH)
Science Commons Text Annotation Service
Webservice that annotates sentences in PubMed abstracts with aTags
Gene Ontology, ChEBI, some smaller OBO ontologies, Uniprot and NCBI Taxonomy
User-generated content, mostly curated from scientific texts on the web, created with the aTag bookmarklet. Contains curated statements in diverse domains such as pharmacology and neuroscience.
Several hundreds of thousands of aTags have been extracted from conclusion sections of PubMed abstracts. The SIDER dataset contains aTags that were generated out of a structured database, SIDER , describing associations between drugs and side-effects that were reported for these drugs. aTags have also been derived from the Psychoactive Drugs Screening Programme (PDSP) Ki Database, which contains quantitative data about ligand-receptor interactions , primarily of psychoactive substances.
Preliminary results of using the aTag Explorer to answer realistic biomedical problems are encouraging, giving results of satisfying accuracy and coverage, even though the underlying datasets and text corpora are still very limited. The combination of aTags enriched with the background information drawn from taxonomy and ontology hierarchies makes it possible to search for scientific statements combining keyword search and exploratory search in an intuitive manner. To illustrate a typical search process with the aTag explorer, we can imagine a pharmaceutical researcher that wants to get a quick overview of products from the company Pfizer, including their types, mechanisms of action and side effects, and draw comparisons to drugs from competing pharmaceutical companies:
The user types “Pfizer” into the search text field an hits enter (normal keyword search).
The search results are more diverse than expected, including persons and institutions. The user selects ‘drug’ in the ‘broader tags’ facet to narrow down the result set (exploratory search).
The result set is updated and shows several drugs developed or manufactured by Pfizer. The ‘tags’ facet now contains a list of names of these drugs, while the ‘broader tags’ facet contains a list of drug classes drawn from the underlying taxonomies / ontologies (e.g., out of the drugs in the current result set, two are antibiotics, one is a statin, et cetera). In scenarios like this one, these lists of facet values do not only serve the purpose of narrowing down search results, but can provide valuable information in themselves.
Now, the users wants to retrieve more detailed information about a particular drug in the result set, Celecoxib, by clicking on the tag. All information in the knowledge base about Celecoxib is shown, including several reported side-effects for this medication. Celecoxib is classified as a ‘pyrazole’ in the ‘broader tags’ facet.
By applying the procedures of keyword search, inspecting ‘tags’ and ‘broader tags’, narrowing down on tags, shifting the focus by selecting related entities, and un-focusing the search by removing selected facet values, the researcher can now obtain information pyrazole drugs manufactured by other companies, and compare their side effects and indications.
Besides these general-purpose aTag search interfaces, there are also more specialized solutions under preparation. For example, Jun Zhao from the University of Oxford created a search interface for aTags specialized to herbal medicines .
The SWAN Scientific Discourse Ontology [31, 32] is currently applied in several biomedical projects. Its representation of scientific assertions resembles the aTag convention, but it also contains additional vocabulary, e.g., for the representation of agreement and disagreement between statements. SWAN has recently been mapped to SIOC as part of the work of the W3C Health Care and Life Science Interest group. We plan to make the developments around aTags compatible with SWAN and try to further interlink both efforts.
The aTags extracted from PubMed conclusion sections can be compared to a variety of sentence-based, whole-abstract search systems include I-HOP , Wikigenes  and MedEvi . While these systems provide far better coverage, the search results contain a lot of unwanted noise produced by statements derived from introduction, methods and results sections of abstracts, producing results that are often not very relevant, unintelligible outside of the context of the entire text, or very redundant (e.g., introduction sections of abstracts often re-iterate the same fact again and again). In comparison, the statements derived by extracting conclusion sections seem to contain far less noise and might provide much better user satisfaction, even though coverage is drastically lower. Further research is needed to characterize these differences in a more objective and quantifiable manner.
First experiences with creating interlinked datasets and interoperable applications based on the aTag convention were very positive. We are currently pursuing several threads of development, among them are:
Converting additional data from existing structured, biomedical databases into aTags.
Adding a ‘tag recommender’ function to the aTag generator by integrating the Open Biomedical Annotator web service  of NCBO.
Creating a mapping between OBO ontologies and DBpedia.
A long-term goal of this work is to drive the integration of aTags or similar RDFa-based means for key assertion integration into the scientific publication process. First advances towards the establishment of ‘structured digital abstracts’ as part of biomedical publications have already been undertaken, as exemplified by the FEBS Letters experiment . In this experiment, authors submitting papers to the molecular biology journal FEBS Letters were called on to submit a structured digital abstract alongside their manuscript, detailing molecular interactions reported in the paper in a machine-readable way. While this and other existing initiatives are important steps forward, they are still lacking interfaces that are easy to use for authors (leading to low compliance by users) and fail to use interlinked, standards-based and semantically rich formats (not realizing the full potential of the structured information). The use of simple, RDFa-based representation formats as outlined in this paper could help to drive the establishment of structured digital abstracts.
Readers that are interested in creating or consuming aTags to solve problems in their particular domains are encouraged to contact the authors of this paper.
Technical details about the tools and datasets are described below.
The aTags generator is based on an Ajax interface and server-side PHP code. Apache Solr  (which is based on Apache Lucene ) is used for the autocompletion of entity tags. Solr/Lucene allows for optimizing the relevance of suggested tags by elaborate ranking based on contextual cues and ontological structures.
aTags were created from existing structured databases through PHP scripts that processed dumps of the original databases.
The Science Commons text annotation service is a RESTful web service written in Python. When invoked, it submits input to the Whatizit SOAP web service hosted at the European Bioinformatics institute. The XML result retrieved from the Whatizit service is parsed, ontology identifiers are mapped to URIs of the RDF/OWL versions of the respective ontologies, and a HTML+RDFa page is generated.
In a sizeable fraction of PubMed abstracts, the narrative of the abstract is clearly delineated by explicit section headers (‘INTRODUCTION:’, ‘METHODS:’, ‘RESULTS:’, ‘CONCLUSIONS:’). A search in PubMed reveals that ~ 1,7 million abstracts contain the words ‘conclusion’ or ‘conclusions’ (out of a total of ~ 19 million citations indexed in PubMed). Most of these abstracts really do contain a clearly delineated conclusion section. This means that a huge corpus of biomedical abstracts with explicit conclusion sections exists, covering a broad area of knowledge domains.
aTags were generated from PubMed conclusion section by creating a PHP script that accomplished the following tasks:
1) Retrieve PubMed abstracts containing conclusion sections for a certain query. The script could process all ~1,7 million abstracts with explicit conclusion sections, but for this trial, a more restrictive query was chosen that retrieves abstracts about emotion and cognition:
("conclusion"[Title/Abstract] OR "conclusions"[Title/Abstract]) AND (antidepressant OR "Emotions"[Mesh] OR "Behavioral Symptoms"[Mesh] OR "Mood Disorders"[Mesh])
This yields 58.000 results. Note that removing the constraint for ‘conclusion’ or ‘conclusions’ in this query would increase the number of results to 430.000, which means that roughly 1/7th of the abstracts for this topic contain an explicit conclusion sections.
2) Abbreviations that are locally defined in each abstract are expanded to their long forms using the Schwartz & Hearst algorithm . In most abstracts, abbreviations are introduced in the introduction section, e.g.:
„ INTRODUCTION: Seasonal affective disorder (SAD) is common in ...“
while the conclusion sections contain lots of these abbreviated forms that tend to be unintelligible when only the conclusion sections are viewed in isolation, e.g.:
„ CONCLUSIONS: This study shows that SAD is effectively treated with ...“
The script recognizes local abbreviations and expands them, making the conclusion sections better intelligible. E.g., after processing the conclusion now reads
“CONCLUSIONS: This study shows that Seasonal affective disorder is effectively treated with …“
3) The conclusion sections are then extracted and turned into aTags serialized in Turtle format. For this trial, each aTag was annotated with the MeSH terms associated with the article. In future work, this could be enhanced or replaced with annotations created by automated entity recognition (e.g., BioPortal webservice or EBI Whatizit) or manual curation.
Thanks to all participants of the W3C Health Care and Life Science interest group. Special thanks to Kei Cheung, Tim Clark, Susie Stephens and M. Scott Marshall for the organization of the interest group. Thanks to Anita de Waard for organizing the HypER group. Thanks to Andreas Harth (DERI Galway) for setting up the Visinav system. The work presented in this paper has been funded in part by the Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2) and by a postdoctoral fellowship from the Konrad Lorenz Institute for Evolution and Cognition Research, Austria.
This article has been published as part of Journal of Biomedical Semantics Volume 1 Supplement 1, 2010: Proceedings of the Bio-Ontologies Special Interest Group Meeting 2009: Knowledge in Biology. The full contents of the supplement are available online at http://www.jbiomedsem.com/supplements/1/S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.