Skip to main content

Table 1 Summary of investigated issues and results covered during BioHackathons 2011 and 2012

From: BioHackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains

RDF data

 
 

Domain specific models

 

Genome and proteome data

 

Issue: No standard RDF data model and tools existed for major genomic data

 

Result: Created FALDO, INSDC, GFF, GVF ontologies and developed converters

 

Software: Converters are now packaged in the BioInterchange tool; improved PSICQUIC service

 

Glycome data

 

Issue: Glycome and proteome databases are not effectively linked

 

Result: Developed a standard RDF representation for carbohydrate structures by BCSDB, GlycomeDB, GLYCOSCIENCES.de, JCGGDB, MonosaccharideDB, RINGS, UniCarbKB and UniProt developers

 

Software: RDFized data from these databases, stored them in Virtuoso and tested SPARQL queries among the different data resources

 

Text processing

 

Text extraction from PDF and metadata retrieval

 

Issue: Text for mining is often buried in the PDF formatted literature and requires preprocessing

 

Result: Incorporated a tool for text extraction combined with a metadata retrieval service for DOIs or PMIDs

 

Software: Used PDFX for text extraction; retrieved metadata by the TogoDoc service

 

Named entity recognition and RDF generation

 

Issue: No standard existed for combining the results of various NER tools

 

Result: Developed a system for combining, viewing, and editing the extracted gene names to provide RDF data

 

  Software: Extended SIO ontology for NER and newly developed the BioInterchange tool for RDF generation

 

Natural language query conversion to SPARQL

 

Issue: Automatic conversion of natural language queries to SPARQL queries is necessary to develop a human friendly interface

 

Result: Incorporated the SNOMED-CT dataset to answer biomedical questions and improved linguistic analysis

 

Software: Improved the in-house LODQA system; used ontologies from BioPortal

Ontology

 
 

IRI mapping and normalization

 

Issue: IRIs for entities automatically generated by BioPortal do not always match with submitted RDF-based ontologies

 

Result: Normalized IRIs in the BioPortal SPARQL endpoint as either the provider IRI, the Identifiers.org IRI, or the Bio2RDF IRI

 

Software: Used services of BioPortal, the MIRIAM registry, Identifires.org and Bio2RDF

 

Environmental ontologies for metagenomics

 

Issue: Semantically controlled description of a sample’s original environment is needed in the domain of metagenomics

 

  Result: Developed the Metagenome Environment Ontology (MEO) for the MicrobeDB project

 

Software: References the Environment Ontology (EnvO) and other ontologies

 

Lexical resources

 

Issue: Standard machine-readable English-Japanese / Japanese-English dictionaries are required for multilingual utilization of RDF data

 

Result: Developed ontology for LSD to serialize the lexical resource in RDF and published it at a SPARQL endpoint

 

Software: Data provided by the Life Science Dictionary (LSD) project

 

Enzyme reaction equations

 

  Issue: New ontology must be developed to represent incomplete enzyme reactions which are not supported by IUBMB

 

Result: Designed semantic representation of incomplete reactions with terms to describe chemical transformation patterns

 

Software: Obtained data from the KEGG database and the result is available at GenomeNet

Metadata

 
 

Service quality indicators

 

Issue: Quality of the published datasets (SPARQL endpoints) is not clearly measured

 

  Result: Measured the availability, response time, content amount and other quality metrics of SPARQL endpoints

 

Software: Web site is under development to illustrate the summary of periodical measurements

 

Database content descriptors

 

Issue: Uniform description of the core attributes of biological databases should be semantically described

 

Result: Developed the RDF Schema for the BioDBCore and improved the BioDBCore Web interface for submission and retrieval

 

Software: Evaluated identifiers for DBs in NAR, DBpedia, Identifiers.org and ORCID and vocabularies from Biositemaps, EDAM, BRO and OBI

 

Generic metadata for dataset description

 

Issue: Database catalogue metadata needs to be machine-readable for enabling automatic discovery

 

  Result: Conventions to describe the nature and availability of datasets will be formalized as a community agreement

 

Software: Members from the W3C HCLS, DBCLS, MEDALS, BioDBCore, Biological Linked Open Data, Biositemaps, Uniprot, Bio2RDF, Biogateway, Open PHACTS, EURECA, and Identifiers.org continue the discussion in teleconferences

Platforms

 
 

RDFization tools

 

Issue: RDF generation tools supporting various data formats and data sources are not yet sufficient

 

Result: Tools to generate RDF from CSV, TSV, XML, GFF3, GVF and other formats including text mining results were developed

 

Software: BioInterchange can be used as a tool, Web services and libraries; bio-table is a generic tool for tabular data

 

Triple stores

 

  Issue: Survey is needed to test scalability of distributed/cluster-based triple stores for multi-resource integration

 

Result: Hadoop-based and Cluster-based triple stores were still immature and federated queries on OWLIM-SE was still inefficient

 

Software: HadoopRDF, SHARD and WebPIE for Hadoop-based triple stores; 4store and bigdata for Cluster-based triple stores

Applications

 
 

Semantic Web exploration and visualization

 

Issue: Interactive exploration and visualization tools for Semantic Web resources are required to make effective queries

 

Result: Tools are reviewed from viewpoints of requirements and availability, features, assistance and support, technical aspects, and specificity to life sciences use cases

 

Software: More than 30 tools currently available are reviewed and classified for benchmarking and evaluations in the future

 

Ontology mapping visualization

 

Issue: Visualization of ontology mapping is required to understand how different ontologies with relating concepts are interconnected

 

Result: Ontology mappings of all BioPortal ontologies and a subset of BioPortal ontologies suitable for OntoFinder/Factory were visualized

 

  Software: Applicability of Google Fusion Tables and Gephi were investigated

 

Identifier conversion service

 

Issue: Multiple synonyms for the same data inhibits cross-resource querying and data mining

 

Result: Developed a new service to extract cross references from UniProt and KEGG databases, eliminate redundancy and visualize the result

 

Software: G-Links resolves and retrieves all corresponding resource URIs

 

Semantic query via voice recognition

 

Issue: Intuitive search interface similar to “Siri for biologists” would be useful

 

Result: Developed a context-aware virtual research assistant Genie which recognizes spoken English and replies in a synthesized voice

 

Software: The G-language GAE, G-language Maps, KBWS EMBASSY and EMBOSS, and G-Links are used for Genie