Skip to main content

Table 1 Summary of investigated issues and results covered during BioHackathons 2011 and 2012

From: BioHackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains

RDF data


Domain specific models


Genome and proteome data


Issue: No standard RDF data model and tools existed for major genomic data


Result: Created FALDO, INSDC, GFF, GVF ontologies and developed converters


Software: Converters are now packaged in the BioInterchange tool; improved PSICQUIC service


Glycome data


Issue: Glycome and proteome databases are not effectively linked


Result: Developed a standard RDF representation for carbohydrate structures by BCSDB, GlycomeDB,, JCGGDB, MonosaccharideDB, RINGS, UniCarbKB and UniProt developers


Software: RDFized data from these databases, stored them in Virtuoso and tested SPARQL queries among the different data resources


Text processing


Text extraction from PDF and metadata retrieval


Issue: Text for mining is often buried in the PDF formatted literature and requires preprocessing


Result: Incorporated a tool for text extraction combined with a metadata retrieval service for DOIs or PMIDs


Software: Used PDFX for text extraction; retrieved metadata by the TogoDoc service


Named entity recognition and RDF generation


Issue: No standard existed for combining the results of various NER tools


Result: Developed a system for combining, viewing, and editing the extracted gene names to provide RDF data


  Software: Extended SIO ontology for NER and newly developed the BioInterchange tool for RDF generation


Natural language query conversion to SPARQL


Issue: Automatic conversion of natural language queries to SPARQL queries is necessary to develop a human friendly interface


Result: Incorporated the SNOMED-CT dataset to answer biomedical questions and improved linguistic analysis


Software: Improved the in-house LODQA system; used ontologies from BioPortal



IRI mapping and normalization


Issue: IRIs for entities automatically generated by BioPortal do not always match with submitted RDF-based ontologies


Result: Normalized IRIs in the BioPortal SPARQL endpoint as either the provider IRI, the IRI, or the Bio2RDF IRI


Software: Used services of BioPortal, the MIRIAM registry, and Bio2RDF


Environmental ontologies for metagenomics


Issue: Semantically controlled description of a sample’s original environment is needed in the domain of metagenomics


  Result: Developed the Metagenome Environment Ontology (MEO) for the MicrobeDB project


Software: References the Environment Ontology (EnvO) and other ontologies


Lexical resources


Issue: Standard machine-readable English-Japanese / Japanese-English dictionaries are required for multilingual utilization of RDF data


Result: Developed ontology for LSD to serialize the lexical resource in RDF and published it at a SPARQL endpoint


Software: Data provided by the Life Science Dictionary (LSD) project


Enzyme reaction equations


  Issue: New ontology must be developed to represent incomplete enzyme reactions which are not supported by IUBMB


Result: Designed semantic representation of incomplete reactions with terms to describe chemical transformation patterns


Software: Obtained data from the KEGG database and the result is available at GenomeNet



Service quality indicators


Issue: Quality of the published datasets (SPARQL endpoints) is not clearly measured


  Result: Measured the availability, response time, content amount and other quality metrics of SPARQL endpoints


Software: Web site is under development to illustrate the summary of periodical measurements


Database content descriptors


Issue: Uniform description of the core attributes of biological databases should be semantically described


Result: Developed the RDF Schema for the BioDBCore and improved the BioDBCore Web interface for submission and retrieval


Software: Evaluated identifiers for DBs in NAR, DBpedia, and ORCID and vocabularies from Biositemaps, EDAM, BRO and OBI


Generic metadata for dataset description


Issue: Database catalogue metadata needs to be machine-readable for enabling automatic discovery


  Result: Conventions to describe the nature and availability of datasets will be formalized as a community agreement


Software: Members from the W3C HCLS, DBCLS, MEDALS, BioDBCore, Biological Linked Open Data, Biositemaps, Uniprot, Bio2RDF, Biogateway, Open PHACTS, EURECA, and continue the discussion in teleconferences



RDFization tools


Issue: RDF generation tools supporting various data formats and data sources are not yet sufficient


Result: Tools to generate RDF from CSV, TSV, XML, GFF3, GVF and other formats including text mining results were developed


Software: BioInterchange can be used as a tool, Web services and libraries; bio-table is a generic tool for tabular data


Triple stores


  Issue: Survey is needed to test scalability of distributed/cluster-based triple stores for multi-resource integration


Result: Hadoop-based and Cluster-based triple stores were still immature and federated queries on OWLIM-SE was still inefficient


Software: HadoopRDF, SHARD and WebPIE for Hadoop-based triple stores; 4store and bigdata for Cluster-based triple stores



Semantic Web exploration and visualization


Issue: Interactive exploration and visualization tools for Semantic Web resources are required to make effective queries


Result: Tools are reviewed from viewpoints of requirements and availability, features, assistance and support, technical aspects, and specificity to life sciences use cases


Software: More than 30 tools currently available are reviewed and classified for benchmarking and evaluations in the future


Ontology mapping visualization


Issue: Visualization of ontology mapping is required to understand how different ontologies with relating concepts are interconnected


Result: Ontology mappings of all BioPortal ontologies and a subset of BioPortal ontologies suitable for OntoFinder/Factory were visualized


  Software: Applicability of Google Fusion Tables and Gephi were investigated


Identifier conversion service


Issue: Multiple synonyms for the same data inhibits cross-resource querying and data mining


Result: Developed a new service to extract cross references from UniProt and KEGG databases, eliminate redundancy and visualize the result


Software: G-Links resolves and retrieves all corresponding resource URIs


Semantic query via voice recognition


Issue: Intuitive search interface similar to “Siri for biologists” would be useful


Result: Developed a context-aware virtual research assistant Genie which recognizes spoken English and replies in a synthesized voice


Software: The G-language GAE, G-language Maps, KBWS EMBASSY and EMBOSS, and G-Links are used for Genie