RDF data | |
Domain specific models | |
Genome and proteome data | |
Issue: No standard RDF data model and tools existed for major genomic data | |
Result: Created FALDO, INSDC, GFF, GVF ontologies and developed converters | |
Software: Converters are now packaged in the BioInterchange tool; improved PSICQUIC service | |
Glycome data | |
Issue: Glycome and proteome databases are not effectively linked | |
Result: Developed a standard RDF representation for carbohydrate structures by BCSDB, GlycomeDB, GLYCOSCIENCES.de, JCGGDB, MonosaccharideDB, RINGS, UniCarbKB and UniProt developers | |
Software: RDFized data from these databases, stored them in Virtuoso and tested SPARQL queries among the different data resources | |
Text processing | |
Text extraction from PDF and metadata retrieval | |
Issue: Text for mining is often buried in the PDF formatted literature and requires preprocessing | |
Result: Incorporated a tool for text extraction combined with a metadata retrieval service for DOIs or PMIDs | |
Software: Used PDFX for text extraction; retrieved metadata by the TogoDoc service | |
Named entity recognition and RDF generation | |
Issue: No standard existed for combining the results of various NER tools | |
Result: Developed a system for combining, viewing, and editing the extracted gene names to provide RDF data | |
Software: Extended SIO ontology for NER and newly developed the BioInterchange tool for RDF generation | |
Natural language query conversion to SPARQL | |
Issue: Automatic conversion of natural language queries to SPARQL queries is necessary to develop a human friendly interface | |
Result: Incorporated the SNOMED-CT dataset to answer biomedical questions and improved linguistic analysis | |
Software: Improved the in-house LODQA system; used ontologies from BioPortal | |
Ontology | |
IRI mapping and normalization | |
Issue: IRIs for entities automatically generated by BioPortal do not always match with submitted RDF-based ontologies | |
Result: Normalized IRIs in the BioPortal SPARQL endpoint as either the provider IRI, the Identifiers.org IRI, or the Bio2RDF IRI | |
Software: Used services of BioPortal, the MIRIAM registry, Identifires.org and Bio2RDF | |
Environmental ontologies for metagenomics | |
Issue: Semantically controlled description of a sample’s original environment is needed in the domain of metagenomics | |
Result: Developed the Metagenome Environment Ontology (MEO) for the MicrobeDB project | |
Software: References the Environment Ontology (EnvO) and other ontologies | |
Lexical resources | |
Issue: Standard machine-readable English-Japanese / Japanese-English dictionaries are required for multilingual utilization of RDF data | |
Result: Developed ontology for LSD to serialize the lexical resource in RDF and published it at a SPARQL endpoint | |
Software: Data provided by the Life Science Dictionary (LSD) project | |
Enzyme reaction equations | |
Issue: New ontology must be developed to represent incomplete enzyme reactions which are not supported by IUBMB | |
Result: Designed semantic representation of incomplete reactions with terms to describe chemical transformation patterns | |
Software: Obtained data from the KEGG database and the result is available at GenomeNet | |
Metadata | |
Service quality indicators | |
Issue: Quality of the published datasets (SPARQL endpoints) is not clearly measured | |
Result: Measured the availability, response time, content amount and other quality metrics of SPARQL endpoints | |
Software: Web site is under development to illustrate the summary of periodical measurements | |
Database content descriptors | |
Issue: Uniform description of the core attributes of biological databases should be semantically described | |
Result: Developed the RDF Schema for the BioDBCore and improved the BioDBCore Web interface for submission and retrieval | |
Software: Evaluated identifiers for DBs in NAR, DBpedia, Identifiers.org and ORCID and vocabularies from Biositemaps, EDAM, BRO and OBI | |
Generic metadata for dataset description | |
Issue: Database catalogue metadata needs to be machine-readable for enabling automatic discovery | |
Result: Conventions to describe the nature and availability of datasets will be formalized as a community agreement | |
Software: Members from the W3C HCLS, DBCLS, MEDALS, BioDBCore, Biological Linked Open Data, Biositemaps, Uniprot, Bio2RDF, Biogateway, Open PHACTS, EURECA, and Identifiers.org continue the discussion in teleconferences | |
Platforms | |
RDFization tools | |
Issue: RDF generation tools supporting various data formats and data sources are not yet sufficient | |
Result: Tools to generate RDF from CSV, TSV, XML, GFF3, GVF and other formats including text mining results were developed | |
Software: BioInterchange can be used as a tool, Web services and libraries; bio-table is a generic tool for tabular data | |
Triple stores | |
Issue: Survey is needed to test scalability of distributed/cluster-based triple stores for multi-resource integration | |
Result: Hadoop-based and Cluster-based triple stores were still immature and federated queries on OWLIM-SE was still inefficient | |
Software: HadoopRDF, SHARD and WebPIE for Hadoop-based triple stores; 4store and bigdata for Cluster-based triple stores | |
Applications | |
Semantic Web exploration and visualization | |
Issue: Interactive exploration and visualization tools for Semantic Web resources are required to make effective queries | |
Result: Tools are reviewed from viewpoints of requirements and availability, features, assistance and support, technical aspects, and specificity to life sciences use cases | |
Software: More than 30 tools currently available are reviewed and classified for benchmarking and evaluations in the future | |
Ontology mapping visualization | |
Issue: Visualization of ontology mapping is required to understand how different ontologies with relating concepts are interconnected | |
Result: Ontology mappings of all BioPortal ontologies and a subset of BioPortal ontologies suitable for OntoFinder/Factory were visualized | |
Software: Applicability of Google Fusion Tables and Gephi were investigated | |
Identifier conversion service | |
Issue: Multiple synonyms for the same data inhibits cross-resource querying and data mining | |
Result: Developed a new service to extract cross references from UniProt and KEGG databases, eliminate redundancy and visualize the result | |
Software: G-Links resolves and retrieves all corresponding resource URIs | |
Semantic query via voice recognition | |
Issue: Intuitive search interface similar to “Siri for biologists” would be useful | |
Result: Developed a context-aware virtual research assistant Genie which recognizes spoken English and replies in a synthesized voice | |
Software: The G-language GAE, G-language Maps, KBWS EMBASSY and EMBOSS, and G-Links are used for Genie |