Skip to main content

Table 1 Summary of investigated issues and results covered during BioHackathons 2011 and 2012

From: BioHackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains

RDF data  
  Domain specific models
  Genome and proteome data
  Issue: No standard RDF data model and tools existed for major genomic data
  Result: Created FALDO, INSDC, GFF, GVF ontologies and developed converters
  Software: Converters are now packaged in the BioInterchange tool; improved PSICQUIC service
  Glycome data
  Issue: Glycome and proteome databases are not effectively linked
  Result: Developed a standard RDF representation for carbohydrate structures by BCSDB, GlycomeDB,, JCGGDB, MonosaccharideDB, RINGS, UniCarbKB and UniProt developers
  Software: RDFized data from these databases, stored them in Virtuoso and tested SPARQL queries among the different data resources
  Text processing
  Text extraction from PDF and metadata retrieval
  Issue: Text for mining is often buried in the PDF formatted literature and requires preprocessing
  Result: Incorporated a tool for text extraction combined with a metadata retrieval service for DOIs or PMIDs
  Software: Used PDFX for text extraction; retrieved metadata by the TogoDoc service
  Named entity recognition and RDF generation
  Issue: No standard existed for combining the results of various NER tools
  Result: Developed a system for combining, viewing, and editing the extracted gene names to provide RDF data
    Software: Extended SIO ontology for NER and newly developed the BioInterchange tool for RDF generation
  Natural language query conversion to SPARQL
  Issue: Automatic conversion of natural language queries to SPARQL queries is necessary to develop a human friendly interface
  Result: Incorporated the SNOMED-CT dataset to answer biomedical questions and improved linguistic analysis
  Software: Improved the in-house LODQA system; used ontologies from BioPortal
  IRI mapping and normalization
  Issue: IRIs for entities automatically generated by BioPortal do not always match with submitted RDF-based ontologies
  Result: Normalized IRIs in the BioPortal SPARQL endpoint as either the provider IRI, the IRI, or the Bio2RDF IRI
  Software: Used services of BioPortal, the MIRIAM registry, and Bio2RDF
  Environmental ontologies for metagenomics
  Issue: Semantically controlled description of a sample’s original environment is needed in the domain of metagenomics
    Result: Developed the Metagenome Environment Ontology (MEO) for the MicrobeDB project
  Software: References the Environment Ontology (EnvO) and other ontologies
  Lexical resources
  Issue: Standard machine-readable English-Japanese / Japanese-English dictionaries are required for multilingual utilization of RDF data
  Result: Developed ontology for LSD to serialize the lexical resource in RDF and published it at a SPARQL endpoint
  Software: Data provided by the Life Science Dictionary (LSD) project
  Enzyme reaction equations
    Issue: New ontology must be developed to represent incomplete enzyme reactions which are not supported by IUBMB
  Result: Designed semantic representation of incomplete reactions with terms to describe chemical transformation patterns
  Software: Obtained data from the KEGG database and the result is available at GenomeNet
  Service quality indicators
  Issue: Quality of the published datasets (SPARQL endpoints) is not clearly measured
    Result: Measured the availability, response time, content amount and other quality metrics of SPARQL endpoints
  Software: Web site is under development to illustrate the summary of periodical measurements
  Database content descriptors
  Issue: Uniform description of the core attributes of biological databases should be semantically described
  Result: Developed the RDF Schema for the BioDBCore and improved the BioDBCore Web interface for submission and retrieval
  Software: Evaluated identifiers for DBs in NAR, DBpedia, and ORCID and vocabularies from Biositemaps, EDAM, BRO and OBI
  Generic metadata for dataset description
  Issue: Database catalogue metadata needs to be machine-readable for enabling automatic discovery
    Result: Conventions to describe the nature and availability of datasets will be formalized as a community agreement
  Software: Members from the W3C HCLS, DBCLS, MEDALS, BioDBCore, Biological Linked Open Data, Biositemaps, Uniprot, Bio2RDF, Biogateway, Open PHACTS, EURECA, and continue the discussion in teleconferences
  RDFization tools
  Issue: RDF generation tools supporting various data formats and data sources are not yet sufficient
  Result: Tools to generate RDF from CSV, TSV, XML, GFF3, GVF and other formats including text mining results were developed
  Software: BioInterchange can be used as a tool, Web services and libraries; bio-table is a generic tool for tabular data
  Triple stores
    Issue: Survey is needed to test scalability of distributed/cluster-based triple stores for multi-resource integration
  Result: Hadoop-based and Cluster-based triple stores were still immature and federated queries on OWLIM-SE was still inefficient
  Software: HadoopRDF, SHARD and WebPIE for Hadoop-based triple stores; 4store and bigdata for Cluster-based triple stores
  Semantic Web exploration and visualization
  Issue: Interactive exploration and visualization tools for Semantic Web resources are required to make effective queries
  Result: Tools are reviewed from viewpoints of requirements and availability, features, assistance and support, technical aspects, and specificity to life sciences use cases
  Software: More than 30 tools currently available are reviewed and classified for benchmarking and evaluations in the future
  Ontology mapping visualization
  Issue: Visualization of ontology mapping is required to understand how different ontologies with relating concepts are interconnected
  Result: Ontology mappings of all BioPortal ontologies and a subset of BioPortal ontologies suitable for OntoFinder/Factory were visualized
    Software: Applicability of Google Fusion Tables and Gephi were investigated
  Identifier conversion service
  Issue: Multiple synonyms for the same data inhibits cross-resource querying and data mining
  Result: Developed a new service to extract cross references from UniProt and KEGG databases, eliminate redundancy and visualize the result
  Software: G-Links resolves and retrieves all corresponding resource URIs
  Semantic query via voice recognition
  Issue: Intuitive search interface similar to “Siri for biologists” would be useful
  Result: Developed a context-aware virtual research assistant Genie which recognizes spoken English and replies in a synthesized voice
  Software: The G-language GAE, G-language Maps, KBWS EMBASSY and EMBOSS, and G-Links are used for Genie