BioHackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains

Katayama, Toshiaki; Wilkinson, Mark D; Aoki-Kinoshita, Kiyoko F; Kawashima, Shuichi; Yamamoto, Yasunori; Yamaguchi, Atsuko; Okamoto, Shinobu; Kawano, Shin; Kim, Jin-Dong; Wang, Yue; Wu, Hongyan; Kano, Yoshinobu; Ono, Hiromasa; Bono, Hidemasa; Kocbek, Simon; Aerts, Jan; Akune, Yukie; Antezana, Erick; Arakawa, Kazuharu; Aranda, Bruno; Baran, Joachim; Bolleman, Jerven; Bonnal, Raoul JP; Buttigieg, Pier Luigi; Campbell, Matthew P; Chen, Yi-an; Chiba, Hirokazu; Cock, Peter JA; Cohen, K Bretonnel; Constantin, Alexandru; Duck, Geraint; Dumontier, Michel; Fujisawa, Takatomo; Fujiwara, Toyofumi; Goto, Naohisa; Hoehndorf, Robert; Igarashi, Yoshinobu; Itaya, Hidetoshi; Ito, Maori; Iwasaki, Wataru; Kalaš, Matúš; Katoda, Takeo; Kim, Taehong; Kokubu, Anna; Komiyama, Yusuke; Kotera, Masaaki; Laibe, Camille; Lapp, Hilmar; Lütteke, Thomas; Marshall, M Scott; Mori, Takaaki; Mori, Hiroshi; Morita, Mizuki; Murakami, Katsuhiko; Nakao, Mitsuteru; Narimatsu, Hisashi; Nishide, Hiroyo; Nishimura, Yosuke; Nystrom-Persson, Johan; Ogishima, Soichi; Okamura, Yasunobu; Okuda, Shujiro; Oshita, Kazuki; Packer, Nicki H; Prins, Pjotr; Ranzinger, Rene; Rocca-Serra, Philippe; Sansone, Susanna; Sawaki, Hiromichi; Shin, Sung-Ho; Splendiani, Andrea; Strozzi, Francesco; Tadaka, Shu; Toukach, Philip; Uchiyama, Ikuo; Umezaki, Masahito; Vos, Rutger; Whetzel, Patricia L; Yamada, Issaku; Yamasaki, Chisato; Yamashita, Riu; York, William S; Zmasek, Christian M; Kawamoto, Shoko; Takagi, Toshihisa

doi:10.1186/2041-1480-5-5

Review
Open access
Published: 05 February 2014

BioHackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains

Toshiaki Katayama¹,
Mark D Wilkinson²,
Kiyoko F Aoki-Kinoshita³,
Shuichi Kawashima¹,
Yasunori Yamamoto¹,
Atsuko Yamaguchi¹,
Shinobu Okamoto¹,
Shin Kawano¹,
Jin-Dong Kim¹,
Yue Wang¹,
Hongyan Wu¹,
Yoshinobu Kano⁴,
Hiromasa Ono¹,
Hidemasa Bono¹,
Simon Kocbek¹,
Jan Aerts^5,6,
Yukie Akune³,
Erick Antezana⁷,
Kazuharu Arakawa⁸,
Bruno Aranda⁹,
Joachim Baran¹⁰,
Jerven Bolleman¹¹,
Raoul JP Bonnal¹²,
Pier Luigi Buttigieg¹³,
Matthew P Campbell¹⁴,
Yi-an Chen¹⁵,
Hirokazu Chiba¹⁶,
Peter JA Cock¹⁷,
K Bretonnel Cohen¹⁸,
Alexandru Constantin¹⁹,
Geraint Duck¹⁹,
Michel Dumontier²⁰,
Takatomo Fujisawa²¹,
Toyofumi Fujiwara²²,
Naohisa Goto²³,
Robert Hoehndorf²⁴,
Yoshinobu Igarashi¹⁵,
Hidetoshi Itaya⁸,
Maori Ito¹⁵,
Wataru Iwasaki²⁵,
Matúš Kalaš²⁶,
Takeo Katoda³,
Taehong Kim²⁷,
Anna Kokubu³,
Yusuke Komiyama²⁸,
Masaaki Kotera²⁹,
Camille Laibe³⁰,
Hilmar Lapp³¹,
Thomas Lütteke³²,
M Scott Marshall³³,
Takaaki Mori³,
Hiroshi Mori³⁴,
Mizuki Morita³⁵,
Katsuhiko Murakami³⁶,
Mitsuteru Nakao³⁷,
Hisashi Narimatsu³⁸,
Hiroyo Nishide¹⁶,
Yosuke Nishimura²⁹,
Johan Nystrom-Persson¹⁵,
Soichi Ogishima³⁹,
Yasunobu Okamura⁴⁰,
Shujiro Okuda⁴¹,
Kazuki Oshita⁸,
Nicki H Packer⁴²,
Pjotr Prins⁴³,
Rene Ranzinger⁴⁴,
Philippe Rocca-Serra⁴⁵,
Susanna Sansone⁴⁵,
Hiromichi Sawaki³⁸,
Sung-Ho Shin²⁷,
Andrea Splendiani^46,47,
Francesco Strozzi⁴⁸,
Shu Tadaka⁴⁰,
Philip Toukach⁴⁹,
Ikuo Uchiyama¹⁶,
Masahito Umezaki⁵⁰,
Rutger Vos⁵¹,
Patricia L Whetzel⁵²,
Issaku Yamada⁵³,
Chisato Yamasaki^15,36,
Riu Yamashita⁵⁴,
William S York⁴⁴,
Christian M Zmasek⁵⁵,
Shoko Kawamoto¹ &
…
Toshihisa Takagi⁵⁶

Journal of Biomedical Semantics volume 5, Article number: 5 (2014) Cite this article

6109 Accesses
39 Citations
17 Altmetric
Metrics details

Abstract

The application of semantic technologies to the integration of biological data and the interoperability of bioinformatics analysis and visualization tools has been the common theme of a series of annual BioHackathons hosted in Japan for the past five years. Here we provide a review of the activities and outcomes from the BioHackathons held in 2011 in Kyoto and 2012 in Toyama. In order to efficiently implement semantic technologies in the life sciences, participants formed various sub-groups and worked on the following topics: Resource Description Framework (RDF) models for specific domains, text mining of the literature, ontology development, essential metadata for biological databases, platforms to enable efficient Semantic Web technology development and interoperability, and the development of applications for Semantic Web data. In this review, we briefly introduce the themes covered by these sub-groups. The observations made, conclusions drawn, and software development projects that emerged from these activities are discussed.

Introduction

In life sciences, the Semantic Web is an enabling technology which could significantly improve the quality and effectiveness of the integration of heterogeneous biomedical resources. The first wave of life science Semantic Web publishing focused on availability - exposing data as RDF without significant consideration for the quality of the data or the adequacy or accuracy of the RDF model used. This allowed a proliferation of proof-of-concept projects that highlighted the potential of Semantic technologies. However, now that we are entering a phase of adoption of Semantic Web technologies in research, quality of data publication must become a serious consideration. This is a prerequisite for the development of translational research and for achieving ambitious goals such as personalized medicine.

While Semantic technologies, in and of themselves, do not fully solve the interoperability and integration problem, they provide a framework within which interoperability is dramatically facilitated by requiring fewer pre-coordinated agreements between participants and enabling unanticipated post hoc integration of their resources. Nevertheless, certain choices must be made, in a harmonized manner, to maximize interoperability. The yearly BioHackathon series [1–3] of events attempts to provide the environment within which these choices can be explored, evaluated, and then implemented on a collaborative and community-guided basis. These BioHackathons were hosted by the National Bioscience Database Center (NBDC) [4] and the Database Center for Life Science (DBCLS) [5] as a part of the Integrated Database Project to integrate life science databases in Japan. In order to take advantage of the latest technologies for the integration of heterogeneous life science data, researchers and developers from around the world were invited to these hackathons.

This paper contains an overview of the activities and outcomes of two highly interrelated BioHackathon events which took place in 2011 [6] and 2012 [7]. The themes of these two events focused on representation, publication, and exploration of bioinformatics data and tools using standards and guidelines set out by the Linked Data and Semantic Web initiatives.

Review

Semantic Web technologies are formalized as World Wide Web consortium (W3C) standards aimed at creating general-purpose, long-lived data representation, exchange, and integration formats that replace current ad hoc solutions. However, because they are general-purpose standards, many issues need to be addressed and agreed-upon by the community in order to apply them successfully to the integration and interoperability problems of the life science domain. Therefore, participants of the BioHackathons fall into sub-groups of interest within the life sciences, representing the specific needs and strengths of their individual communities within the broader context of life science informatics. Though there were multiple specific activity groups under each of the following headings, and there was overlap and cross-talk between the activities of each group, we will organize this review under the five general categories of: RDF data, Ontology, Metadata, Platforms and Applications (Figure 1). Results and issues raised by each group are briefly summarized in the Table 1. We also note that many groups have or will publish their respective outcomes in individual publications.

Table 1 Summary of investigated issues and results covered during BioHackathons 2011 and 2012

Full size table

RDF data

In terms of RDF data generation, data were generated for genomic and glycomic databases (domain-specific models) and from the literature using text processing technologies. We describe these two subcategories here.

Domain specific models

Genome and proteome data Due to the high-throughput generation of genomic data, it is of high priority to generate RDF models for both nucleotide sequence annotations and amino acid sequence annotations. Up to now, nucleotide sequence annotations are provided in a variety of formats such as the International Nucleotide Sequence Database Collaboration (INSDC) [8], Generic Feature Format (GFF) [9] and Genome Variation Format (GVF) [10]. By RDFizing this information, all of the annotations from various sequencing projects can be integrated in a straightforward manner. This would in turn accommodate the data integration requirements of the H-InvDB [11]. In general, due to the large variety of genomic annotations possible, it was decided that in the first iteration of a genomic RDF model, opaque Universally Unique IDentifiers (UUIDs) are to be used to represent sequence features. Each UUID would then be typed with its appropriate ontology, such as Sequence Ontology (SO), and sequence location would be specified using Feature Annotation Location Description Ontology (FALDO) [12, 13]. FALDO was newly developed at the BioHackathon 2012 by representatives of UniProt [14], DDBJ [15] and genome scientists for the purpose of generically locating regions on the biological sequences (e.g., modification sites on a protein sequence, fuzzy promoter locations on a DNA sequence etc.). A locally-defined vocabulary was used to annotate other aspects such as sequence version and synonymy. Thus, a generic system for nucleotide and amino acid sequence annotations could be proposed. Converters were also developed that would output compatible RDF documents, such as HMMER3 [16], GenBank/DDBJ [17], GTF [18] and GFF2OWL [19]. The RDF output for Proteomics Standard Initiative Common QUery InterfaCe (PSICQUIC) [20], a tool to retrieve molecular interaction data from multiple repositories with more than 150 milion interactions available at the time of writing, was modified during the Biohackathon 2011 to improve the mapping of identifiers and ontologies. Identifiers.org was chosen as the provider for the new IRIs for the interacting proteins and ontology terms to allow a better integration with other sources. PSICQUIC RDF output is based on the popular BioPAX format [21] for interactions and pathways.

Glycome data The Glycomics working group consisted of developers from the major glycomics databases including Bacterial Carbohydrate Structure Database (BCSDB) [22], GlycomeDB [23, 24], GLYCOSCIENCES.de [25], Japan Consortium for Glycobiology and Glycotechnology Database (JCGGDB) [26], MonosaccharideDB [27], Resource for INformatics of Glycomes at Soka (RINGS) [28], and UniCarbKB [29]. These databases contain information about glycan structures, or complex carbohydrates, which are often covalently linked to proteins forming glycoproteins. The connections between glycomics and proteomics databases are required to accurately describe the properties and potential biological functions of glycoproteins. In order to establish such a connection this working group cooperated with UniProt developers present at the BioHackathon to agree upon and develop a standard RDF representation for carbohydrate structures, along with the relevant biological and bibliographic annotations and experimental evidence. Data from the individual databases have been exported in the newly developed RDF format (version 0.1) and stored in a triple store, allowing for cross-database queries. Several proof-of-concept queries were tested to show that federated queries could be made across multiple databases to demonstrate the potential for this technology in glycomics research. For example, both UniProt and JCGGDB are important databases in their respective domains of protein sequences and glycomics data. Moreover, UniCarbKB is becoming an important glycomics resource as well. However, since UniCarbKB is not linked with JCGGDB, a SPARQL query was described to find the JCGGDB entries for each respective UniCarbKB entry. Aoki-Kinoshita et al., 2013 [30] this was made possible by the integration of UniCarbKB, JCGGDB and GlycomeDB data, which served as the link between the former two datasets. This would not have been possible without agreement upon the standardization of the pertinent glycomics data in each database, discussed at BioHackathons.

Text processing

The Data Mining and Natural Language Processing (NLP) groups focused their efforts in two primary domains: information extraction from scientific text - particularly from PDF articles - in the form of ontology-grounded triples, and the conversion of natural language questions into triples and/or SPARQL queries. Both of these were pursued with an eye to standardization and interoperability between life science databases.

Text extraction from PDF and metadata retrieval The first step in information extraction is ensuring that accurate plain-text representations of scientific documents are available. A widely recognized “choke point” that inhibits the processing and mining of vast biomedical document stores has been the fact that the bulk of information within them is often available only as PDF-formatted documents. Access to this information is crucial for a variety of needs, including accessibility to model organism database curators and the population of RDF triple stores. In confronting this issue, the BioHackers worked on a novel software project called PDFX [31, 32], which automatically converts the PDF scientific articles to XML form. The general use case was to include PDFX as a pre-processing step within a wide variety of more involved processing pipelines, such as the additional concerns of the BioHackathon data mining and NLP groups presented next. Complementing text extraction from PDF documents, when this process is employed, it also becomes necessary to retrieve relevant metadata information. This was done using DBCLS’s TogoDoc [33] literature management and recommendation system, which detects the Digital Object Identifier (DOI) or PubMed identifiers of PDF submissions in order to retrieve metadata information such as MeSH terms and make recommendations to users.

Named entity recognition and RDF generation Once text is in processable form, the next phase of information extraction is entity recognition within the text. The field of gene name extraction suffers from a prevalence of diverse annotation schemata, ontologies, definitions of semantic classes, and standards regarding where the edges of gene names should be marked within a corpus (an annotated collection of topic-specific text). In 2011, the NLP/text mining group worked on an application for combining, viewing and editing the outputs of a variety of gene-mention-detection systems, with the goal of providing RDF outputs of protein/gene annotation tools such as GNAT [34], GeneTUKit [35], and BANNER [36]. The Annotation Ontology was used to represent these metadata. However, at the 2012 event, the SIO ontology [37] was extended to enable representation of entity-recognition outputs directly in RDF: resources were described in terms of a number of novel relation types (properties) and incorporated in an inheritance and partonymy hierarchy. Using these various components as a proof of concept, the NLP sub-group began developing a generic RDFization framework, BioInterchange [38], comprised of three pipelined steps - data deserialization, object model generation, and RDF serialization - to enable easy data conversion into RDF with automatic ontological mappings primarily to SIO and secondarily to other ontologies.

Natural language query conversion to SPARQL The final activity within the NLP theme was the conversion of natural language queries to SPARQL queries. SPARQL queries are a natural interface to RDF triple-store endpoints, but they remain challenging to construct, even for those with intimate knowledge of the target data schema. It would be easier, for example, to enable users to ask a question such as “What is the sequence length for human TP53?” and receive an answer from the UniProt database, based on a SPARQL query that the system constructs automatically. A pre-existing tool from the DBCLS that can accomplish natural-language-to-SPARQL conversion was targeted and customized for the SNOMED-CT [39] dataset in BioPortal [40]. A large set of natural language test queries were developed, and for a subset of those queries the post-conversion output was analyzed and compared to a manually created gold standard output; subsequently, the group undertook a linguistic analysis of what conversions would have to be carried out in order to transform the current system output to the gold standard. These efforts included using natural language generation technology to build a Python solution that generates hundreds of morphological and syntactic variants of various natural language question types.

Ontology

IRI mapping and normalization

The first step in any semantic integration activity is to agree on the identifiers for various concepts. BioPortal, a central repository for biomedical ontologies, allows users to download original ontology files in a variety of formats (OWL [41], OBO [42], etc.), but also makes these ontologies available using RDF through a Web service and SPARQL endpoint [43]. In RDF, entities (classes, relations and individuals) are identified using an Internationalized Resource Identifier (IRI); however, the identifiers that are automatically generated by BioPortal do not always match with those used in submitted RDF-based ontologies, thereby impeding integration across ontologies. Moreover, since ontologies are also used to semantically annotate biomedical data, there is a lack of semantic integration between data and ontology. BioHackathon activities included surveying, mapping, and normalizing the IRIs present in the RDF-based ontologies found in the BioPortal SPARQL endpoint to a canonical set of IRIs in a custom dataset and namespace registry, primarily used by the Bio2RDF project [44]. This registry is being integrated with the MIRIAM Registry [45] which powers Identifiers.org, thereby enabling users to select either the provider IRI (if available), the Identifiers.org IRI (if available), or the Bio2RDF IRI (for all data and ontologies) [46].

Environmental ontologies for metagenomics

In the domain of metagenomics, establishing a semantically controlled description of a sample’s original environment is essential for reliably archiving and retrieving relevant datasets. The BioHackathon resulted in a strategy for the re-engineering of the Metagenome Environment Ontology (MEO) [47], closely linked to the MicrobeDB project [48], to serve as community-specific portal to resources such as the Environment Ontology (EnvO) [49]. In this role, MEO will deliver curated, high-value subsets of such resources to the (meta)genomics community for use in efficient, semantically controlled annotation of sample environments. Additionally, MEO will enrich and shape the ontologies and vocabularies it references through persistently consolidating and submitting feedback from its users.

An ontology for lexical resources

The Life Science Dictionary (LSD) [50] consists of various lexical resources including English-Japanese/Japanese-English dictionaries with >230,000 terms, a thesaurus using the MeSH vocabulary [51, 52], and co-occurring data that show how often a pair of terms appear in a MEDLINE [53] entry. LSD has been edited and maintained by the LSD project since 1993 and provides a search service on the Web, as well as a downloadable version. To assist with machine-readability of this important lexical resource, the group developed an ontology for this dataset [54], and an RDF serialization of the LSD was designed and coded at the BioHackathon. As a result, a total of 5,600,000 triples were generated and made available at the SPARQL endpoint [55].

An ontology for incomplete enzyme reaction equations

Incomplete enzyme reactions are not of interest to International Union of Biochemistry and Molecular Biology (IUBMB; who manage EC numbers) [56], but are common in metabolomics. Enzymes and reactions are described in Gene Ontology (GO) [57] and Enzyme Mechanism Ontology (EMO) [58], but they just follow the classification of IUBMB. It would be helpful to establish a structured representation to describe the available knowledge out of the reaction of interest even if the equation is not complete. Semantic representation of incomplete enzyme reaction equations was designed based on ontological principles. About 6,800 complete reaction equations taken from the KEGG [59, 60] database were decomposed into 13,733 incomplete reactions, from which 2,748 chemical transformation patterns were obtained. They were classified into a semantic data structure, consisting of about 1,100 terms (functional groups, substructures, and reaction types) commonly used in organic chemistry and biochemistry. We keep curating the ontology for incomplete enzyme reaction equations aiming at its use in metabolome and other omics-level researches (available at GenomeNet [61]).

Metadata

Metadata activities at the BioHackathon could be grouped into three areas of focus: service quality indicators, database content descriptors, and a broader inclusive discussion of generic metadata that could be used to characterize datasets in a database catalogue for enhanced data discovery, assessment, and access (not limited to but still useful for biodatabases).

Service quality indicators

With respect to data quality, the BioHackers coined the phrase “Yummy Data” as a shorthand way of expressing not only data quality, but more importantly, the ability to explicitly determine the quality of a given dataset. While quality of the published data is an important issue, it is a domain that depends as much on the underlying biological experiments as the code that analyses them. As such, the data quality working group at the BioHackathon focused on the issue of testing the quality of the published data endpoint, with respect to endpoint availability and other metrics. Therefore, the Yummy Data project [62] was initiated that periodically inspects the availability, response time, content amount and a few quality metrics for a selection of SPARQL endpoints of interest to biomedical investigators. While neither defining, nor executing, an exhaustive set of useful quality-measurements, it is hoped that this software may act as a starting point that encourages others to measure the “yumminess” of the data they provide, and thereby improve the quality of the published semantic resources for the global community.

Database content descriptors

The BioDBCore project [63, 64] has created a community-defined, uniform, generic description of the core attributes of biological databases that will allow potential users of that database to determine its suitability for their task at hand (e.g. taxonomic range, update frequency, etc.). The proposed BioDBCore core descriptors are overseen by the International Society for Biocuration (ISB) [65], in collaboration with the BioSharing initiative [66]. One of the key activities of BioDBCore discussion at the BioHackathon was to define the RDF Schema and relevant annotation vocabularies and ontologies capable of representing the nature of biological data resources. As mentioned above, RDF representations necessitate the choice of a stable URI for each resource. The persistent identifiers considered for biological databases included NAR database collection [67, 68], DBpedia [69, 70], Identifiers.org and ORCID [71], while vocabularies from Biositemaps [72], EMBRACE Data and Methods (EDAM) [73], Biomedical Resource Ontology (BRO) [74] and The Ontology for Biomedical Investigations (OBI) [75] were evaluated to describe features such as resource and data types, and area-of-research. The exploration involved several specific use cases, including METI Life science integrated database portal (MEDALS) [76] and NBDC/DBCLS [77]. Another key activity at the hackathons was focused on the BioDBCore Web interface [78], both for submission and retrieval. Open issues include how to specify the useful interconnectivity between databases, for example, in planning cross-resource queries, and how to describe the content of biological resources in a machine-readable way to make it easily queried by SPARQL even if the vocabularies of any given resources are used. Currently, the group is considering the idea of using the named graph of a resource to store these kinds of metadata. There was also inter-group discussion of how to integrate BioDBCore with other projects such as DRCAT [79], which defines a similar, overlapping set of biological resources and their features.

Generic metadata for dataset description

The generic metadata discussion started by defining the problem of making database catalogue metadata machine-readable, so that a given dataset is automatically discoverable and accessible by machine agents using SPARQL. We discussed a set of conventions to describe the nature and availability of datasets on the emerging life science Semantic Web. In addition to basic descriptions, we focused our effort on elements of origin, licensing, (re-)distribution, update frequency, data formats and availability, language, vocabulary and content summaries. We expect that adherence to a small number of simple conventions will not only facilitate discovery of independently generated and published data, but also create the basis for the emergence of a data marketplace, a competitive environment to offer redundant access to ever higher quality data. These discussions have continued in teleconferences hosted by the W3C Health Care and Life Sciences Interest Group (HCLSIG) [80], and included at various times stakeholders such as DBCLS, MEDALS, BioDBCore, Biological Linked Open Data (BioLOD) [81], Biositemaps, UniProt, Bio2RDF, Biogateway [82], Open PHACTS [83], EURECA [84] and Identifiers.org.

Platforms

RDFization tools

Generation of RDF data often requires iterative trials. In an early stage of prototyping RDF data, it is recommended to use OpenRefine [85] (formerly known as Google Refine) with the RDF extension [86] for correcting fluctuations of data, generation of URIs from ID literals and eventually converting tabular data into RDF. To automate the procedure, various hackathon initiatives generated RDFization tools and libraries, particularly for the Bio* projects. A generic tool, bio-table [87], can be used for converting tabular data into RDF, using powerful filters and overrides. This command-line tool is freely available as a biogem package and expanded during the BioHackathon to include support for named columns. Another Ruby biogem binary and library called bio-rdf [88] utilizes bio-table and generates RDF data from the results of genomic analysis including gene enrichment, QTL and other protocols implemented in the R/Bioconductor. The BioInterchange was conceived and designed during BioHackathon 2012 as a tool, web services and libraries for Ruby, Python and Java languages to create RDF triples from files in TSV, XML, GFF3, GVF and other formats including text mining results. User can specify external ontologies for the conversion and the project also developed biomedical ontologies of necessity for GFF3 and GVF data [89]. ONTO-PERL [90], a tool to handle ontologies represented in the OBO format, was extended to allow conversion of Gene Ontology (GO) annotations as RDF (GOA2RDF). Moreover, given that most legacy data resources have a corresponding XML schema, some effort was put into exploring and coding automated Schema-to-RDF translation tools for many of the widely used bioinformatics data formats such as BioXSD [91]. After working with the EDAM developers at the BioHackathon to modify their URI format to fit more naturally with an RDF representation, the EDAM ontology was successfully used to annotate the relevant portions of an automated BioXSD transformation, suggesting that significantly greater interoperability between bioinformatics resources should soon be enabled.

Triple stores

Moving from individual endpoints to multi-resource integration, the BioHackathon working group on triplestores also explored the problem of deploying multiple, interdependent and distributed triplestores, as well as searching over these, which included the examination of cluster-based triplestores, Hadoop-based triple stores [92–94], and emergent federated search systems. The group determined that Hadoop-based stores were not mature enough to be used for production use because it works with only limited types of data, and lacks functionality such as exposing a SPARQL endpoint, user interface, and so on. Regarding cluster-based triplestores, the group found that there was insufficient documentation regarding installation so this could not be tested sufficiently. Federated search using SPARQL 1.1 [95] could only be tested on OWLIM [96] at the time, and it was found that queries could not work efficiently across multiple endpoints. Thus, while single-source semantic publication seems to be well supported, the technologies backing distributed semantic datasets - both from the publisher’s and the consumer’s perspective - are lacking at this time.

Applications

Semantic Web exploration and visualization

The Semantic Web simplifies the integration of heterogeneous information without the need for a pre-coordinated comprehensive schema. As a trade-off, querying Semantic Web resources poses particular challenges: how can a researcher understand what is in a knowledge base, and how can he or she understand its information structure enough to make effective queries? Interactive exploration and visualization tools offer intuitive approaches to information discovery and can help applied researchers to effectively make use of Semantic Web resources. In the previous edition of the BioHackathon, a working group focused on the development of prototypes to visualize RDF knowledge bases. As Semantic Web and Linked Data resources are becoming more available, in the life sciences and beyond, several new tools (interactive or not) for visualization of these kinds of resources have been proposed. The 2011 edition of the BioHackathon has created a review of such available tools, in view of their applicability in the biomedical domain. Through inspections and surveys we have gathered basic information on more than 30 tools currently available. In particular we have gathered information on:

Requirements and availability The operating systems supported, hardware requirements, licensing and costs. Relevant to an applied biomedical domain, we have also considered the availability of simplified install procedures.

Features The type of data access supported (e.g., via SPARQL endpoint or files-based), the type of query formulation supported (creating of graphic patterns, text based queries, boolean queries), whether some reasoning services is provided or exploited. Finally, when possible we have recorded some indication of type of user interaction proposed (e.g., browsing versus link discovery).

Assistance and support Whenever possible, we have collected information on the availability of community-based or commercial support, the availability of documentations, the frequency of software updates and the availability of user groups and mailing lists, for which we have sketched approximate activity metrics.

Technical aspects Whether the observed tools can be embedded in other systems, or if they provide a plugin architecture. When relevant, in which language they are developed, and finally which standards they support (e.g., VoID [97], SPARQL 1.1).

Specificity to life sciences use cases Finally, we have tried to collect information highlighting the usability of these tools in life sciences research (e.g., life sciences bundled datasets, relevant demo cases, citations per research area).

This collection of information is useful to decide which tools are potentially usable given constraints of technical, expertise or reliability nature. Following this data collection exercise, we have started to devise a classification of tools, by identifying some defining key characteristics. For instance, a key characteristic of the surveyed tools is their approach to data: some focus more on instance data and tend to provide a graph-like metaphor. Some focus more on classes and relations and tend to present a class-based access. Another key aspect is the degree to which visualization tools aim at supporting data exploration, rather than explanation. Based on our classification, we aim at choosing a few representative tools, provide some benchmarking and evaluate how different types of tools are effective in simple tasks.

Ontology mapping visualization

Ontology mapping deals with relating concepts from different ontologies and is typically concerned with the representation and storage of mappings between the concepts [98]. BioPortal ontologies [40] are usually interconnected, and mappings between them are available, although a visualization of these mappings is not currently available. Two types of mapping visualizations were explored at the BioHackathon: (1) A visualization of ontology mappings of all BioPortal ontologies, and (2) A visualization of a subset of BioPortal ontologies that would be useful in OntoFinder/Factory [99] - a tool for finding relevant BioPortal ontologies and also building new ontologies. The hackers investigated the applicability and utility of two tools/environments: Google Fusion Tables [100], and Gephi [101]. This work is ongoing.

Identifier conversion service

The existence of multiple synonyms for the same data (sets) often inhibits cross-resource querying and data mining. Thus, a centralized server containing curated links between and among life-science databases would greatly facilitate the data integration tasks in bioinformatics. The members of the G-language [102] group began developing an identifier conversion Web service named G-Links. Based on the cross referencing information available from UniProt and KEGG, this RESTful service retrieves all identifiers and their corresponding PURLs related to an identifier provided by the user. In addition, users may supply nucleotide or amino acid sequences in place of the identifier, for rapid annotation of sequences. In order to comply with the recent Semantic Web and Linked Data initiatives, results can be returned in N-triples or RDF/XML formats for interoperability, as well as the legacy GenBank, EMBL and tabular formats (Table 2). This service is freely available at http://link.g-language.org/.

Table 2 Example queries using G-Links

Full size table

One of the central advantages of Linked Data as an end-user biologist is the ease of discovery and retrieval of related information. On the other hand, biological data is highly inter-related, and the multitude of linkages can easily become overwhelming, resulting in familiar “hair balls” frequently seen in protein-interaction networks. Sophisticated filtering of Linked Data result sets, ranking the results according to relevance to one’s interests, or by some form of enrichment of interesting phenomena would assist greatly in interpreting the content of semantic data stores. Such filtering, or data arrangement and presentation, should ideally be accompanied by an intuitive visualization. Participants pursued these goals by first generating a complete genome (gene set) of Escherichia coli as Linked Data using G-Links, together with several associated numerical datasets calculated through the G-language REST Web service [103] (a product of BioHackathon 2009). Statistics such as Cramer’s V for nominal data and Spearman’s rank correlation for continuous data were applied to data coming from multiple, overlapping sources (e.g. KEGG versus Reactome [104] versus BioCyc [105] for pathways) to cluster result sets according to their similarity. This would allow, for example, a user to choose the least-redundant subset of results in order to maximize the amount of unique information passed to a visualization tool. Using the inverse, these metrics can be used to screen for enrichment, where over-representation of the same dataset is considered meaningful, and therefore that dataset should be highlighted. An example of both types of filtering was created by the participants using the JavaScript InfoViz Tookit [106]. The resulting graph is highly interactive, and all nodes representing data sets can be clicked to re-layout the graphs centering to the clicked data set, with animations. Demonstrations using pre-calculated E. coli data are available [107, 108].

Natural language semantic query via voice recognition

Finally, the project that generated the most “buzz” among the participants in BioHackathon 2012 was Genie - a “Siri [109] for Biologists”. The G-language Project members undertook the development of a virtual research assistant for bioinformatics, designed to be an intuitive entry-level gateway for database searches. The prototype developed and demonstrated at the BioHackathon was limited to gene- and genome-centric questions. Users communicate with Genie using spoken English, and Genie replies in a synthesized voice. Genie can find information on three main categories: 1. Anything about a gene of interest, such as, what is the sequence, function, cellular localization, pathway, related disease, related SNPs and polymorphisms, interactions, regulations, expression levels; 2. Anything about a set of genes, based on multiple criteria. For example, all SNPs in genes that are related to cancer, that work as transferases, that are expressed in the cytoplasm, and that have orthologs in mice; 3. Anything about a genome, such as, production of different types of visual maps, calculation of GC skews, prediction of origins and terminus of replication, calculation of codon usage bias, and so on. Using an NLP and dictionary-based approach, with the species name as a top-level filter to reduce the search/retrieval space, annotations are fetched for this species, and a dictionary of gene names is created dynamically. In order to implement integrated information retrieval, the following software systems were used:

The G-language Genome Analysis Environment and its REST service which allows for extremely rapid genome-centric information retrieval.
G-language Maps (Genome Projector and Pathway Projector, as well as Chaos Game Representation REST Service) which visualizes that genomic information.
Keio Bioinformatics Web Services EMBASSY package and EMBOSS [110], which provides more than 400 tools that can be applied to the information.
G-Links - an extremely rapid gene-centric data aggregator.

The Genie prototype is accessible online [111, 112].

Conclusions

BioHackathon series started out with the Integrated Database Project of Japan, aiming to integrate all life science databases in Japan. Initially, the focus was on Web services and workflows to enable efficient data retrieval. However, the focus eventually shifted towards Semantic Web technologies due to the increasing heterogeneity and interlinked nature of the data at hand, for example, from the accumulation of next-generation sequencing data and their annotations. From this, the community recognized the importance of RDF and ontology development - fundamental Semantic Web technologies that have also come to gain the attention of other domains in the life sciences, including genome science, glycosciences and protein science. For example, BioMart and InterMine, which were initially developed to aid the integration of life science data, has now started to support Semantic Web technologies. These hackathons have served as a driving force towards integration of data “islands” that have slowly started linking to one another through RDF development. However, insufficient guidelines, ontologies and tools to support RDF development has hampered true integration. The development of such guidelines, ontologies and tools has been the central focus of these hackathons, bringing together the community on a consistent basis, and we have finally started to grow buds from these efforts. We expect to bear fruit in the near future by the development of biomedical and metagenome applications on top of these developments. Moreover, we expect that text mining will become increasingly vital to enriching life science Semantic Web data with the knowledge currently hidden within the literature.

Abbreviations

ASCII:: American Standard Code for Information Interchange
BCSDB:: Bacterial Carbohydrate Structure Database
BRO:: Biomedical Resource Ontology
CAI:: Codon Adaptation Index
DBCLS:: Database Center for Life Sciences
DOI:: Digital Object Identifier
DRCAT:: Data Resource CATalogue
EDAM:: EMBRACE Data And Methods
EMBOSS:: European Molecular Biology Open Software Suite
EnvO:: Environment Ontology
EURECA:: Enabling information re-Use by linking clinical REsearch and Care
FALDO:: Feature Annotation Location Description Ontology
FOP:: Frequency of OPtimal codons
GAE:: Genome Analysis Environment
GFF3:: Generic Feature Format version 3
GFF2OWL:: Generic Feature Format to Web Ontology Language
GO:: Gene Ontology
GOA2RDF:: Gene Ontology Annotations to RDF
INSDC:: International Nucleotide Sequence Database Collaboration
IRI:: Internationalized Resource Identifier
ISB:: International Society for Biocuration
IUBMB:: International Union of Biochemistry and Molecular Biology
JCGGDB:: Japan Consortium for Glycobiology and Glycotechnology Database
KEGG:: Kyoto Encyclopedia of Genes and Genomes
LSD:: Life Science Dictionary
MEDALS:: METI DAtabase portal for Life Science
MEO:: Metagenome Environment Ontology
MeSH:: Medical Subject Headings
MIRIAM:: Minimal Information Required In the Annotation of Models
NBDC:: National Bioscience Database Center
NLP:: Natural Language Processing
NCBO:: National Center for Biomedical Ontology
OBI:: Ontology for Biomedical Investigations
OBO:: Open Biomedical Ontology
Open PHACTS:: Open Pharmaceutical Triple Store
OWL:: Web Ontology Language
PDF:: Portable Document Format
PHX:: Predicted Highly eXpressed genes
PURL:: Permanent URL
RDF:: Resource Description Framework
REST:: REpresentational State Transfer
RINGS:: Resource for INformatics of Glycomes at Soka
SIO:: Semanticscience Integrated Ontology
SNPs:: Single Nucleotide Polymophisms
SO:: Sequence Ontology
SPARQL:: SPARQL Protocol and RDF Query Language
UUID:: Universally Unique Identifier
XML:: eXtensible Markup Language.

References

Katayama T, Arakawa K, Nakao M: The DBCLS BioHackathon: standardization and interoperability for bioinformatics web services and workflows. J Biomed Semantics. 2010, 1: 8-10.1186/2041-1480-1-8.
Article Google Scholar
Katayama T, Wilkinson MD, Vos R: The 2nd DBCLS BioHackathon: interoperable bioinformatics Web services for integrated applications. J Biomed Semantics. 2011, 2: 4-10.1186/2041-1480-2-4.
Article Google Scholar
Katayama T, Wilkinson MD, Micklem G: The 3rd DBCLS BioHackathon: improving life science data integration with semantic Web technologies. J Biomed Semantics. 2013, 4: 6-10.1186/2041-1480-4-6.
Article Google Scholar
NBDC.http://biosciencedbc.jp/en/,
DBCLS.http://dbcls.rois.ac.jp/en/,
BioHackathon. 2011,http://2011.biohackathon.org/,
BioHackathon. 2012,http://2012.biohackathon.org/,
Nakamura Y, Cochrane G, Karsch-Mizrachi I: The international nucleotide sequence database collaboration. Nucleic Acids Res. 2013, 41: D21-D24. 10.1093/nar/gks1084.
Article Google Scholar
GFF3.http://www.sequenceontology.org/gff3.shtml,
Reese MG, Moore B, Batchelor C: A standard variation file format for human genome sequences. Genome Biol. 2010, 11: R88-10.1186/gb-2010-11-8-r88.
Article Google Scholar
Takeda J-I, Yamasaki C, Murakami K: H-InvDB in 2013: an omics study platform for human functional gene and transcript discovery. Nucleic Acids Res. 2013, 41: D915-D919. 10.1093/nar/gks1245.
Article Google Scholar
FALDO.http://biohackathon.org/resource/faldo,
Bolleman J, Mungall CJ, Strozzi F: FALDO: A semantic standard for describing the location of nucleotide and protein feature annotation. bioRxiv. doi:10.1101/002121
Consortium UP: Reorganizing the protein space at the universal protein resource (UniProt). Nucleic Acids Res. 2012, 40: D71-D75.
Article Google Scholar
Ogasawara O, Mashima J, Kodama Y: DDBJ new system and service refactoring. Nucleic Acids Res. 2013, 41: D25-D29. 10.1093/nar/gks1152.
Article Google Scholar
BioHackathon/HMMER3 to RDF.https://github.com/dbcls/bh11/wiki/Hmmer3-rdf-xml,
BioHackathon/INSDC to RDF.https://github.com/dbcls/bh11/wiki/OpenBio,
BioHackathon/GTF to RDF.https://github.com/dbcls/bh12/wiki/Cufflinks-rdf,
BioHackathon/GFF3 to OWL.https://code.google.com/p/gff3-to-owl/source/browse/trunk/GFF2OWL.groovy,
Aranda B, Blankenburg H, Kerrien S: PSICQUIC and PSISCORE: accessing and scoring molecular interactions. Nat Methods. 2011, 8: 528-529. 10.1038/nmeth.1637.
Article Google Scholar
Demir E, Cary MP, Paley S: The BioPAX community standard for pathway data sharing. Nat Biotechnol. 2010, 28: 935-942. 10.1038/nbt.1666.
Article Google Scholar
Toukach PV: Bacterial carbohydrate structure database 3: principles and realization. J Chem Inf Model. 2011, 51: 159-170. 10.1021/ci100150d.
Article Google Scholar
Ranzinger R, Frank M, von der Lieth C-W, Herget S: Glycome-DB.org: a portal for querying across the digital world of carbohydrate sequences. Glycobiology. 2009, 19: 1563-1567. 10.1093/glycob/cwp137.
Article Google Scholar
Ranzinger R, Herget S, von der Lieth C-W, Frank M: GlycomeDB–a unified database for carbohydrate structures. Nucleic Acids Res. 2011, 39: D373-D376. 10.1093/nar/gkq1014.
Article Google Scholar
Lütteke T, Bohne-Lang A, Loss A: GLYCOSCIENCES.de: an Internet portal to support glycomics and glycobiology research. Glycobiology. 2006, 16: 71R-81R. 10.1093/glycob/cwj049.
Article Google Scholar
JCGGDB.http://jcggdb.jp/index_en.html,
MonosaccharideDB.http://www.monosaccharidedb.org/,
Akune Y, Hosoda M, Kaiya S, Shinmachi D, Aoki-Kinoshita KF: The RINGS resource for glycome informatics analysis and data mining on the Web. Omics. 2010, 14: 475-486. 10.1089/omi.2009.0129.
Article Google Scholar
Campbell MP, Hayes CA, Struwe WB: UniCarbKB: putting the pieces together for glycomics research. Proteomics. 2011, 11: 4117-4121. 10.1002/pmic.201100302.
Article Google Scholar
Aoki-Kinoshita KF, Bolleman J, Campbell MP: Introducing glycomics data into the Semantic Web. J Biomed Semantics. 2013, 4: 39-10.1186/2041-1480-4-39.
Article Google Scholar
Constantin A, Pettifer S, Voronkov A: PDFX: Fully-automated PDF-to-XML Conversion of Scientific Literature. Proceedings of the 13th ACM symposium on Document Engineering: 10-13 September 2013; Florence, Italy. 2013, 177-180.
Chapter Google Scholar
PDFX.http://pdfx.cs.man.ac.uk/,
Iwasaki W, Yamamoto Y, Takagi T: TogoDoc server/client system: smart recommendation and efficient management of life science literature. PLoS One. 2010, 5: e15305-10.1371/journal.pone.0015305.
Article Google Scholar
Hakenberg J, Gerner M, Haeussler M: The GNAT library for local and remote gene mention normalization. Bioinformatics. 2011, 27: 2769-2771. 10.1093/bioinformatics/btr455.
Article Google Scholar
Huang M, Liu J, Zhu X: GeneTUKit: a software for document-level gene normalization. Bioinformatics. 2011, 27: 1032-1033. 10.1093/bioinformatics/btr042.
Article Google Scholar
Leaman R, Gonzalez G: BANNER: an executable survey of advances in biomedical named entity recognition. Pac Symp Biocomput. 2008, 13: 652-663.
Google Scholar
SIO.http://semanticscience.org/,
BioInterchange.http://www.biointerchange.org/,
Stearns MQ, Price C, Spackman KA, Wang AY: SNOMED clinical terms: overview of the development process and project status. Proceedings of AMIA Symposium: 3-7 November 2001; Washington, DC. 2001, 662-666.
Google Scholar
Whetzel PL, Noy NF, Shah NH: BioPortal: enhanced functionality via new web services from the national center for biomedical ontology to access and use ontologies in software applications. Nucleic Acids Res. 2011, 39: W541-W545. 10.1093/nar/gkr469.
Article Google Scholar
OWL.http://www.w3.org/TR/owl2-overview/,
OBO.http://oboformat.googlecode.com/svn/branches/2011-11-29/doc/obo-syntax.html,
BioPortal SPARQL endpoint.http://sparql.bioontology.org/,
Callahan A, Cruz-Toledo J, Dumontier M: Ontology-based querying with Bio2RDF’s linked open data. J Biomed Semantics. 2012, 4: S1-
Article Google Scholar
Juty N, Novère NL, Laibe C: Identifiers.org and MIRIAM registry: community resources to provide persistent identification. Nucleic Acids Res. 2012, 40: D580-D586. 10.1093/nar/gkr1097.
Article Google Scholar
Juty N, Le NN, Hermjakob H, Laibe C: Towards the Collaborative Curation of the Registry underlying identifiers.org. Database. 2013, 2013: bat017-
Article Google Scholar
MEO.http://mdb.bio.titech.ac.jp/meo/about_meo,
MicrobeDB. : ,http://microbedb.jp/,
EnvO.http://environmentontology.org/,
LSD.http://lsd.pharm.kyoto-u.ac.jp/en/,
Rogers FB: Medical subject headings. Bull Med Libr Assoc. 1963, 51: 114-116.
Google Scholar
MeSH.http://www.ncbi.nlm.nih.gov/mesh,
MEDLINE.http://www.nlm.nih.gov/pubs/factsheets/medline.html,
LSD ontology.http://purl.jp/bio/10/lsd/ontology/201209,
LSD SPARQL endpoint.http://purl.jp/bio/10/lsd/sparql,
McDonald AG, Boyce S, Tipton KF: ExplorEnz: the primary source of the IUBMB enzyme list. Nucleic Acids Res. 2009, 37: D593-D597. 10.1093/nar/gkn582.
Article Google Scholar
Ashburner M, Ball CA, Blake JA: Gene Ontology: tool for the unification of biology. Nat Genet. 2000, 25: 25-29. 10.1038/75556.
Article Google Scholar
EMO.http://bioportal.bioontology.org/ontologies/EMO,
Muto A, Kotera M, Tokimatsu T: Modular architecture of metabolic pathways revealed by conserved sequences of reactions. J Chem Inf Model. 2013, 53: 613-622. 10.1021/ci3005379.
Article Google Scholar
Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M: KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2012, 40: D109-D114. 10.1093/nar/gkr988.
Article Google Scholar
GenomeNet.http://www.genome.jp/,
YummyData.http://yummydata.org/,
Gaudet P, Bairoch A, Field D: Towards BioDBcore: a community-defined information specification for biological databases. Nucleic Acids Res. 2011, 39: D7-D10. 10.1093/nar/gkq1173.
Article Google Scholar
Gaudet P, Bairoch A, Field D: Towards BioDBcore: a community-defined information specification for biological databases. Database. 2011, 2011: baq027-
Article Google Scholar
ISB.http://biocurator.org/,
Baker NA, Klemm JD, Harper SL: Standardizing data. Nat Nanotechnol. 2013, 8: 73-74.
Article Google Scholar
Fernández-Suárez XM, Galperin MY: The 2013 nucleic acids research database issue and the online molecular biology database collection. Nucleic Acids Res. 2013, 41: D1-D7. 10.1093/nar/gks1297.
Article Google Scholar
NAR Database summary paper.http://www.oxfordjournals.org/nar/database/cap/,
DBpedia.http://dbpedia.org/,
Yamamoto Y, Yamaguchi A, Yonezawa A: Building linked open data towards integration of biomedical scientific literature with DBpedia. J Biomed Semantics. 2013, 4: 8-10.1186/2041-1480-4-8.
Article Google Scholar
ORCID.http://orcid.org/,
Biositemaps.http://biositemap.org/,
Ison J, Kalas M, Jonassen I: EDAM: An ontology of bioinformatics operations, types of data and identifiers, topics, and formats. Bioinformatics. 2013, 29: 1325-1332. 10.1093/bioinformatics/btt113.
Article Google Scholar
BRO.http://bioportal.bioontology.org/ontologies/BRO,
OBI.http://obi-ontology.org/,
MEDALS.http://medals.jp/etop,
LSDB catalog.http://integbio.jp/dbcatalog/?lang=en,
BioDBCore web interface.http://biosharing.org/biodbcore,
DRCAT.http://drcat.sourceforge.net/,
W3C HCLSIG.http://www.w3.org/wiki/HCLSIG,
BioLOD.http://biolod.org/,
Biogateway.http://www.semantic-systems-biology.org/biogateway,
Open PHACTS.http://www.openphacts.org/,
EURECA.http://eurecaproject.eu/,
OpenRefine.http://openrefine.org/,
OpenRefine RDF extention.http://refine.deri.ie/,
Biogem/bio-table.http://rubygems.org/gems/bio-table,
Biogem/bio-rdf.http://rubygems.org/gems/bio-rdf,
GFF3 and GVF ontology.http://www.biointerchange.org/ontologies.html,
Antezana E, Egaña M, De Baets B, Kuiper M, Mironov V: ONTO-PERL: an API for supporting the development and analysis of bio-ontologies. Bioinformatics. 2008, 24: 885-887. 10.1093/bioinformatics/btn042.
Article Google Scholar
Kalaš M, Puntervoll P, Joseph A: BioXSD: the common data-exchange format for everyday bioinformatics web services. Bioinformatics. 2010, 26: i540-i546. 10.1093/bioinformatics/btq391.
Article Google Scholar
SHARD Triple-Store.http://www.avometric.com/shard.shtml,
HadoopRDF.http://code.google.com/p/hadooprdf/,
WebPIE.http://www.few.vu.nl/~jui200/webpie.html,
SPARQL 1.1.http://www.w3.org/TR/sparql11-query/,
OWLIM.http://www.ontotext.com/owlim,
VoID.http://www.w3.org/TR/void/,
Granitzer M, Sabol V, Onn KW, Lukose D, Tochtermann K: Ontology alignment—a survey with focus on visually supported semi-automatic techniques. Future Internet. 2010, 2: 238-258. 10.3390/fi2030238.
Article Google Scholar
OntoFinder/OntoFactory.http://ontofinder.dbcls.jp/,
Google fusion tables.http://www.google.com/fusiontables/,
Gephi.https://gephi.org/,
Arakawa K, Mori K, Ikeda K: G-language genome analysis environment: a workbench for nucleotide sequence data mining. Bioinformatics. 2003, 19: 305-306. 10.1093/bioinformatics/19.2.305.
Article Google Scholar
Arakawa K, Kido N, Oshita K, Tomita M: G-language genome analysis environment with REST and SOAP web service interfaces. Nucleic Acids Res. 2010, 38: W700-W705. 10.1093/nar/gkq315.
Article Google Scholar
Croft D, O’Kelly G, Wu G: Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 2011, 39: D691-D697. 10.1093/nar/gkq1018.
Article Google Scholar
Caspi R, Altman T, Dreher K: The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res. 2012, 40: D742-D753. 10.1093/nar/gkr1014.
Article Google Scholar
JavaScript InfoVis Toolkit.http://thejit.org/,
G-Links demo 1.http://ws.g-language.org/toys/bh11/,
G-Links demo 2.http://ws.g-language.org/toys/bh11/index2.html,
Siri.https://www.apple.com/ios/siri/,
Rice P, Longden I, Bleasby A: EMBOSS: the European molecular biology open software suite. Trends Genet. 2000, 16: 276-277. 10.1016/S0168-9525(00)02024-2.
Article Google Scholar
Genie.http://ws.g-language.org/genie/.
Genie video.http://www.youtube.com/watch?v=V4jsuIOAwyM.

Download references

Acknowledgements

BioHackathon 2011 and 2012 were supported by the Integrated Database Project (Ministry of Education, Culture, Sports, Science and Technology of Japan) and hosted by the National Bioscience Database Center (NBDC) and the Database Center for Life Science (DBCLS).

Author information

Authors and Affiliations

Database Center for Life Science, Research Organization of Information and Systems, 2-11-16, Yayoi, Bunkyo-ku, Tokyo, 113-0032, Japan
Toshiaki Katayama, Shuichi Kawashima, Yasunori Yamamoto, Atsuko Yamaguchi, Shinobu Okamoto, Shin Kawano, Jin-Dong Kim, Yue Wang, Hongyan Wu, Hiromasa Ono, Hidemasa Bono, Simon Kocbek & Shoko Kawamoto
Centro de Biotecnología y Genómica de Plantas UPM-INIA (CBGP), Universidad Politécnica de Madrid, Campus Montegancedo, 28223-Pozuelo de, Alarcón, Spain
Mark D Wilkinson
Department of Bioinformatics, Faculty of Engineering, Soka University, 1-236 Tangi-machi, Hachioji, Tokyo, 192-8577, Japan
Kiyoko F Aoki-Kinoshita, Yukie Akune, Takeo Katoda, Anna Kokubu & Takaaki Mori
National Institute of Informatics, JST PRESTO, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430, Japan
Yoshinobu Kano
Department of Electrical Engineering (ESAT/SCD), University of Leuven, Kasteelpark Arenberg 10, Leuven, 3001, Belgium
Jan Aerts
iMinds Future Health Department, University of Leuven, Kasteelpark Arenberg 10, Leuven, 3001, Belgium
Jan Aerts
Department of Biology, Norwegian University of Science and Technology (NTNU), Høgskoleringen 5, Trondheim, N-7491, Norway
Erick Antezana
Institute for Advanced Biosciences, Keio University, Endo 5322, Fujisawa, Kanagawa, 252-0882, Japan
Kazuharu Arakawa, Hidetoshi Itaya & Kazuki Oshita
Silicon Cat Ltd, 5 York Road, London, HA6 1JJ, UK
Bruno Aranda
Ontario Institute for Cancer Research, 101 College Street, Suite 800, Toronto, Ontario, M5G 0A3, Canada
Joachim Baran
SIB Swiss Institute of Bioinformatics, CMU, rue Michel Servet, Geneve, 4 1211, Switzerland
Jerven Bolleman
Integrative Biology Program, Istituto Nazionale Genetica Molecolare, Milan, 20122, Italy
Raoul JP Bonnal
The Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research, Am Handelshafen 12, Bremerhaven, 27570, Germany
Pier Luigi Buttigieg
Biomolecular Frontiers Research Centre, Macquarie University, North Ryde, NSW, 2109, Australia
Matthew P Campbell
National Institute of Biomedical Innovation, 7-6-8 Asagi Saito, Ibaraki-City, Osaka, 567-0085, Japan
Yi-an Chen, Yoshinobu Igarashi, Maori Ito, Johan Nystrom-Persson & Chisato Yamasaki
National Institute for Basic Biology, National Institutes of Natural Sciences, Nishigonaka 38, Myodaiji, Okazaki, Aichi, 444-8585, Japan
Hirokazu Chiba, Hiroyo Nishide & Ikuo Uchiyama
The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
Peter JA Cock
Center for Computational Pharmacology, University of Colorado Denver School of Medicine, Aurora, CO, 80045, USA
K Bretonnel Cohen
School of Computer Science, The University of Manchester, Oxford Road, M13 9PL, UK
Alexandru Constantin & Geraint Duck
Department of Biology, Institute of Biochemistry, School of Computer Science, Carleton University, 1125 Colonel By Drive, Ottawa, Ontario, K1S 5B6, Canada
Michel Dumontier
Center for Information Biology, National Institute of Genetics, Research Organization of Information and Systems, 1111 Yata, Mishima, Shizuoka, 411-08540, Japan
Takatomo Fujisawa
INTEC Inc, 1-3-3 Shinsuna, Koto-ku, Tokyo, 136-8637, Japan
Toyofumi Fujiwara
Genome Information Research Center, Research Institute for Microbial Diseases, Osaka University, 3-1 Yamadaoka, Suita, Osaka, 565-0871, Japan
Naohisa Goto
Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge, CB2 3EG, UK
Robert Hoehndorf
Atmosphere and Ocean Research Institute, the University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba, 277-8564, Japan
Wataru Iwasaki
Computational Biology Unit, Uni Computing and Department of Informatics, University of Bergen, Thormøhlensgate 55, Bergen, 5008, Norway
Matúš Kalaš
Korea Institute of Science Technology and Information, 245 Daehangno, Yuseong, Daejeon, 305-806, Korea
Taehong Kim & Sung-Ho Shin
Department of Biotechnology, Bioinformation Engineering Laboratory, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1, Yayoi, Bunkyo-ku, Tokyo, 113-8657, Japan
Yusuke Komiyama
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto, 611-0011, Japan
Masaaki Kotera & Yosuke Nishimura
EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
Camille Laibe
National Evolutionary Synthesis Center (NESCent), 2024 W. Main St, Durham, NC, USA
Hilmar Lapp
Justus-Liebig-University Giesse, Institute of Veterinary Physiology and Biochemistry, Frankfurter Str. 100, Giessen, 35392, Germany
Thomas Lütteke
MAASTRO Clinic, Maastricht, Postbus 3035, Maastricht, 6202 NA, The Netherlands
M Scott Marshall
Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 4259 B-36, Nagatsuta-cho, Midori-ku, Yokohama, 226-8501, Japan
Hiroshi Mori
Center for Knowledge Structuring, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan
Mizuki Morita
Biomedicinal Information Research Center, National Institute of Advanced Industrial Science and Technology, Aomi 2-4-7, Koto-ku, Tokyo, 135-0064, Japan
Katsuhiko Murakami & Chisato Yamasaki
Next Generation Systems Core Function Unit, Eisai Product Creation Systems, Eisai Co., Ltd, 5-3-1 Toukoudai, Tsukuba, Ibaraki, 300-2635, Japan
Mitsuteru Nakao
Research Center for Medical Glycoscience, National Institute of Advanced Industrial Science and Technology (AIST), 1-1-1 Umezono, Tsukuba, Ibaraki, 305-8568, Japan
Hisashi Narimatsu & Hiromichi Sawaki
Department of Bioclinical informatics, Tohoku Medical Megabank Organization, Tohoku University, Seiryo-cho 4-1, Aoba-ku, Sendai-shi, Miyagi, 980-8575, Japan
Soichi Ogishima
Graduate School of Information Sciences (GSIS), Tohoku University, 6-3-09 Aoba, Aramaki-aza, Aoba-ku, Sendai, Miyagi, 980-8575, Japan
Yasunobu Okamura & Shu Tadaka
Niigata University Graduate School of Medical and Dental Sciences, 1-757 Asahimachi-dori, Chuo-ku, Niigata, 951-8510, Japan
Shujiro Okuda
Biomolecular Frontiers Research Centre, Macquarie University, North Ryde, NSW, 2109, Australia
Nicki H Packer
Laboratory of Nematology, Droevendaalsesteeg 1, Wageningen University, Wageningen, Netherlands
Pjotr Prins
Department of Biochemistry and Molecular Biology, The University of Georgia, 315 Riverbend Road, Athens, GA, 30602, USA
Rene Ranzinger & William S York
Oxford e-Research Center, University of Oxford, Oxford, OX1 3QG, UK
Philippe Rocca-Serra & Susanna Sansone
Digital Enterprise Research Institute, IDA Business Park, Lower Dangan, Galway, Ireland
Andrea Splendiani
intelliLeaf.com, Cambridge, UK
Andrea Splendiani
CeRSA, Parco Tecnologico Padano, Lodi, 26900, Italy
Francesco Strozzi
Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Leninsky prospekt 47, Moscow, 119991, Russia
Philip Toukach
Division of International Cooperative Research, Research Center for Ethnomedicine, Institute of Natural Medicine, University of Toyama, 2630 Sugitani, Toyama, 930-0194, Japan
Masahito Umezaki
Naturalis Biodiversity Center, Postbus 9517, Leiden, 2300 RA, the Netherlands
Rutger Vos
Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, 94305-5479, USA
Patricia L Whetzel
Laboratory of Glyco-organic Chemistry, The Noguchi Institute, 1-8-1 Kaga, Itabashi-ku, Tokyo, 173-0003, Japan
Issaku Yamada
Tohoku Medical Megabank Organization, Tohoku University, Research Building No.3, 6-3-09 Aoba, Aramaki-aza, Aoba-ku, Sendai, Miyagi, 980-8575, Japan
Riu Yamashita
Program on Bioinformatics and Systems Biology, Sanford-Burnham Medical Research Institute, La Jolla, CA, 92037, USA
Christian M Zmasek
Department of Computational Biology, University of Tokyo, Kashiwa, Chiba, 277-8568, Japan
Toshihisa Takagi

Authors

Toshiaki Katayama
View author publications
You can also search for this author in PubMed Google Scholar
Mark D Wilkinson
View author publications
You can also search for this author in PubMed Google Scholar
Kiyoko F Aoki-Kinoshita
View author publications
You can also search for this author in PubMed Google Scholar
Shuichi Kawashima
View author publications
You can also search for this author in PubMed Google Scholar
Yasunori Yamamoto
View author publications
You can also search for this author in PubMed Google Scholar
Atsuko Yamaguchi
View author publications
You can also search for this author in PubMed Google Scholar
Shinobu Okamoto
View author publications
You can also search for this author in PubMed Google Scholar
Shin Kawano
View author publications
You can also search for this author in PubMed Google Scholar
Jin-Dong Kim
View author publications
You can also search for this author in PubMed Google Scholar
Yue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hongyan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yoshinobu Kano
View author publications
You can also search for this author in PubMed Google Scholar
Hiromasa Ono
View author publications
You can also search for this author in PubMed Google Scholar
Hidemasa Bono
View author publications
You can also search for this author in PubMed Google Scholar
Simon Kocbek
View author publications
You can also search for this author in PubMed Google Scholar
Jan Aerts
View author publications
You can also search for this author in PubMed Google Scholar
Yukie Akune
View author publications
You can also search for this author in PubMed Google Scholar
Erick Antezana
View author publications
You can also search for this author in PubMed Google Scholar
Kazuharu Arakawa
View author publications
You can also search for this author in PubMed Google Scholar
Bruno Aranda
View author publications
You can also search for this author in PubMed Google Scholar
Joachim Baran
View author publications
You can also search for this author in PubMed Google Scholar
Jerven Bolleman
View author publications
You can also search for this author in PubMed Google Scholar
Raoul JP Bonnal
View author publications
You can also search for this author in PubMed Google Scholar
Pier Luigi Buttigieg
View author publications
You can also search for this author in PubMed Google Scholar
Matthew P Campbell
View author publications
You can also search for this author in PubMed Google Scholar
Yi-an Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hirokazu Chiba
View author publications
You can also search for this author in PubMed Google Scholar
Peter JA Cock
View author publications
You can also search for this author in PubMed Google Scholar
K Bretonnel Cohen
View author publications
You can also search for this author in PubMed Google Scholar
Alexandru Constantin
View author publications
You can also search for this author in PubMed Google Scholar
Geraint Duck
View author publications
You can also search for this author in PubMed Google Scholar
Michel Dumontier
View author publications
You can also search for this author in PubMed Google Scholar
Takatomo Fujisawa
View author publications
You can also search for this author in PubMed Google Scholar
Toyofumi Fujiwara
View author publications
You can also search for this author in PubMed Google Scholar
Naohisa Goto
View author publications
You can also search for this author in PubMed Google Scholar
Robert Hoehndorf
View author publications
You can also search for this author in PubMed Google Scholar
Yoshinobu Igarashi
View author publications
You can also search for this author in PubMed Google Scholar
Hidetoshi Itaya
View author publications
You can also search for this author in PubMed Google Scholar
Maori Ito
View author publications
You can also search for this author in PubMed Google Scholar
Wataru Iwasaki
View author publications
You can also search for this author in PubMed Google Scholar
Matúš Kalaš
View author publications
You can also search for this author in PubMed Google Scholar
Takeo Katoda
View author publications
You can also search for this author in PubMed Google Scholar
Taehong Kim
View author publications
You can also search for this author in PubMed Google Scholar
Anna Kokubu
View author publications
You can also search for this author in PubMed Google Scholar
Yusuke Komiyama
View author publications
You can also search for this author in PubMed Google Scholar
Masaaki Kotera
View author publications
You can also search for this author in PubMed Google Scholar
Camille Laibe
View author publications
You can also search for this author in PubMed Google Scholar
Hilmar Lapp
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Lütteke
View author publications
You can also search for this author in PubMed Google Scholar
M Scott Marshall
View author publications
You can also search for this author in PubMed Google Scholar
Takaaki Mori
View author publications
You can also search for this author in PubMed Google Scholar
Hiroshi Mori
View author publications
You can also search for this author in PubMed Google Scholar
Mizuki Morita
View author publications
You can also search for this author in PubMed Google Scholar
Katsuhiko Murakami
View author publications
You can also search for this author in PubMed Google Scholar
Mitsuteru Nakao
View author publications
You can also search for this author in PubMed Google Scholar
Hisashi Narimatsu
View author publications
You can also search for this author in PubMed Google Scholar
Hiroyo Nishide
View author publications
You can also search for this author in PubMed Google Scholar
Yosuke Nishimura
View author publications
You can also search for this author in PubMed Google Scholar
Johan Nystrom-Persson
View author publications
You can also search for this author in PubMed Google Scholar
Soichi Ogishima
View author publications
You can also search for this author in PubMed Google Scholar
Yasunobu Okamura
View author publications
You can also search for this author in PubMed Google Scholar
Shujiro Okuda
View author publications
You can also search for this author in PubMed Google Scholar
Kazuki Oshita
View author publications
You can also search for this author in PubMed Google Scholar
Nicki H Packer
View author publications
You can also search for this author in PubMed Google Scholar
Pjotr Prins
View author publications
You can also search for this author in PubMed Google Scholar
Rene Ranzinger
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Rocca-Serra
View author publications
You can also search for this author in PubMed Google Scholar
Susanna Sansone
View author publications
You can also search for this author in PubMed Google Scholar
Hiromichi Sawaki
View author publications
You can also search for this author in PubMed Google Scholar
Sung-Ho Shin
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Splendiani
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Strozzi
View author publications
You can also search for this author in PubMed Google Scholar
Shu Tadaka
View author publications
You can also search for this author in PubMed Google Scholar
Philip Toukach
View author publications
You can also search for this author in PubMed Google Scholar
Ikuo Uchiyama
View author publications
You can also search for this author in PubMed Google Scholar
Masahito Umezaki
View author publications
You can also search for this author in PubMed Google Scholar
Rutger Vos
View author publications
You can also search for this author in PubMed Google Scholar
Patricia L Whetzel
View author publications
You can also search for this author in PubMed Google Scholar
Issaku Yamada
View author publications
You can also search for this author in PubMed Google Scholar
Chisato Yamasaki
View author publications
You can also search for this author in PubMed Google Scholar
Riu Yamashita
View author publications
You can also search for this author in PubMed Google Scholar
William S York
View author publications
You can also search for this author in PubMed Google Scholar
Christian M Zmasek
View author publications
You can also search for this author in PubMed Google Scholar
Shoko Kawamoto
View author publications
You can also search for this author in PubMed Google Scholar
Toshihisa Takagi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Toshiaki Katayama.

Additional information

Competing interests

The authors declare they have no competing interests.

Authors’ contributions

TK, MDW, and KFA primarily wrote the manuscript based on the group summaries written by participants. TK, SK, YY, AY, SO, SK2, JK, YW, HW, YK, HO, HB, SK3, SK4, TT organized BioHackathon 2011 and/or 2012. All authors except for JA and NP attended the BioHackathon 2011 and/or 2012. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Katayama, T., Wilkinson, M.D., Aoki-Kinoshita, K.F. et al. BioHackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains. J Biomed Semant 5, 5 (2014). https://doi.org/10.1186/2041-1480-5-5

Download citation

Received: 16 May 2013
Accepted: 26 November 2013
Published: 05 February 2014
DOI: https://doi.org/10.1186/2041-1480-5-5

BioHackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains

Abstract

Introduction

Review

RDF data

Domain specific models

Text processing

Ontology

IRI mapping and normalization

Environmental ontologies for metagenomics

An ontology for lexical resources

An ontology for incomplete enzyme reaction equations

Metadata

Service quality indicators

Database content descriptors

Generic metadata for dataset description

Platforms

RDFization tools

Triple stores

Applications

Semantic Web exploration and visualization

Ontology mapping visualization

Identifier conversion service

Natural language semantic query via voice recognition

Conclusions

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Biomedical Semantics

Contact us