BioHackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains

Katayama, Toshiaki; Wilkinson, Mark D; Aoki-Kinoshita, Kiyoko F; Kawashima, Shuichi; Yamamoto, Yasunori; Yamaguchi, Atsuko; Okamoto, Shinobu; Kawano, Shin; Kim, Jin-Dong; Wang, Yue; Wu, Hongyan; Kano, Yoshinobu; Ono, Hiromasa; Bono, Hidemasa; Kocbek, Simon; Aerts, Jan; Akune, Yukie; Antezana, Erick; Arakawa, Kazuharu; Aranda, Bruno; Baran, Joachim; Bolleman, Jerven; Bonnal, Raoul JP; Buttigieg, Pier Luigi; Campbell, Matthew P; Chen, Yi-an; Chiba, Hirokazu; Cock, Peter JA; Cohen, K Bretonnel; Constantin, Alexandru; Duck, Geraint; Dumontier, Michel; Fujisawa, Takatomo; Fujiwara, Toyofumi; Goto, Naohisa; Hoehndorf, Robert; Igarashi, Yoshinobu; Itaya, Hidetoshi; Ito, Maori; Iwasaki, Wataru; Kalaš, Matúš; Katoda, Takeo; Kim, Taehong; Kokubu, Anna; Komiyama, Yusuke; Kotera, Masaaki; Laibe, Camille; Lapp, Hilmar; Lütteke, Thomas; Marshall, M Scott; Mori, Takaaki; Mori, Hiroshi; Morita, Mizuki; Murakami, Katsuhiko; Nakao, Mitsuteru; Narimatsu, Hisashi; Nishide, Hiroyo; Nishimura, Yosuke; Nystrom-Persson, Johan; Ogishima, Soichi; Okamura, Yasunobu; Okuda, Shujiro; Oshita, Kazuki; Packer, Nicki H; Prins, Pjotr; Ranzinger, Rene; Rocca-Serra, Philippe; Sansone, Susanna; Sawaki, Hiromichi; Shin, Sung-Ho; Splendiani, Andrea; Strozzi, Francesco; Tadaka, Shu; Toukach, Philip; Uchiyama, Ikuo; Umezaki, Masahito; Vos, Rutger; Whetzel, Patricia L; Yamada, Issaku; Yamasaki, Chisato; Yamashita, Riu; York, William S; Zmasek, Christian M; Kawamoto, Shoko; Takagi, Toshihisa

doi:10.1186/2041-1480-5-5

Journal of Biomedical Semantics

Table 1 Summary of investigated issues and results covered during BioHackathons 2011 and 2012

From: BioHackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains

RDF data
	*Domain specific models*
	Genome and proteome data
	*Issue*: No standard RDF data model and tools existed for major genomic data
	*Result*: Created FALDO, INSDC, GFF, GVF ontologies and developed converters
	*Software*: Converters are now packaged in the BioInterchange tool; improved PSICQUIC service
	Glycome data
	*Issue*: Glycome and proteome databases are not effectively linked
	*Result*: Developed a standard RDF representation for carbohydrate structures by BCSDB, GlycomeDB, GLYCOSCIENCES.de, JCGGDB, MonosaccharideDB, RINGS, UniCarbKB and UniProt developers
	*Software*: RDFized data from these databases, stored them in Virtuoso and tested SPARQL queries among the different data resources
	*Text processing*
	Text extraction from PDF and metadata retrieval
	*Issue*: Text for mining is often buried in the PDF formatted literature and requires preprocessing
	*Result*: Incorporated a tool for text extraction combined with a metadata retrieval service for DOIs or PMIDs
	*Software*: Used PDFX for text extraction; retrieved metadata by the TogoDoc service
	Named entity recognition and RDF generation
	*Issue*: No standard existed for combining the results of various NER tools
	*Result*: Developed a system for combining, viewing, and editing the extracted gene names to provide RDF data
	*Software*: Extended SIO ontology for NER and newly developed the BioInterchange tool for RDF generation
	Natural language query conversion to SPARQL
	*Issue*: Automatic conversion of natural language queries to SPARQL queries is necessary to develop a human friendly interface
	*Result*: Incorporated the SNOMED-CT dataset to answer biomedical questions and improved linguistic analysis
	*Software*: Improved the in-house LODQA system; used ontologies from BioPortal
Ontology
	*IRI mapping and normalization*
	*Issue*: IRIs for entities automatically generated by BioPortal do not always match with submitted RDF-based ontologies
	*Result*: Normalized IRIs in the BioPortal SPARQL endpoint as either the provider IRI, the Identifiers.org IRI, or the Bio2RDF IRI
	*Software*: Used services of BioPortal, the MIRIAM registry, Identifires.org and Bio2RDF
	*Environmental ontologies for metagenomics*
	*Issue*: Semantically controlled description of a sample’s original environment is needed in the domain of metagenomics
	*Result*: Developed the Metagenome Environment Ontology (MEO) for the MicrobeDB project
	*Software*: References the Environment Ontology (EnvO) and other ontologies
	*Lexical resources*
	*Issue*: Standard machine-readable English-Japanese / Japanese-English dictionaries are required for multilingual utilization of RDF data
	*Result*: Developed ontology for LSD to serialize the lexical resource in RDF and published it at a SPARQL endpoint
	*Software*: Data provided by the Life Science Dictionary (LSD) project
	*Enzyme reaction equations*
	*Issue*: New ontology must be developed to represent incomplete enzyme reactions which are not supported by IUBMB
	*Result*: Designed semantic representation of incomplete reactions with terms to describe chemical transformation patterns
	*Software*: Obtained data from the KEGG database and the result is available at GenomeNet
Metadata
	*Service quality indicators*
	*Issue*: Quality of the published datasets (SPARQL endpoints) is not clearly measured
	*Result*: Measured the availability, response time, content amount and other quality metrics of SPARQL endpoints
	*Software*: Web site is under development to illustrate the summary of periodical measurements
	*Database content descriptors*
	*Issue*: Uniform description of the core attributes of biological databases should be semantically described
	*Result*: Developed the RDF Schema for the BioDBCore and improved the BioDBCore Web interface for submission and retrieval
	*Software*: Evaluated identifiers for DBs in NAR, DBpedia, Identifiers.org and ORCID and vocabularies from Biositemaps, EDAM, BRO and OBI
	*Generic metadata for dataset description*
	*Issue*: Database catalogue metadata needs to be machine-readable for enabling automatic discovery
	*Result*: Conventions to describe the nature and availability of datasets will be formalized as a community agreement
	*Software*: Members from the W3C HCLS, DBCLS, MEDALS, BioDBCore, Biological Linked Open Data, Biositemaps, Uniprot, Bio2RDF, Biogateway, Open PHACTS, EURECA, and Identifiers.org continue the discussion in teleconferences
Platforms
	*RDFization tools*
	*Issue*: RDF generation tools supporting various data formats and data sources are not yet sufficient
	*Result*: Tools to generate RDF from CSV, TSV, XML, GFF3, GVF and other formats including text mining results were developed
	*Software*: BioInterchange can be used as a tool, Web services and libraries; bio-table is a generic tool for tabular data
	*Triple stores*
	*Issue*: Survey is needed to test scalability of distributed/cluster-based triple stores for multi-resource integration
	*Result*: Hadoop-based and Cluster-based triple stores were still immature and federated queries on OWLIM-SE was still inefficient
	*Software*: HadoopRDF, SHARD and WebPIE for Hadoop-based triple stores; 4store and bigdata for Cluster-based triple stores
Applications
	*Semantic Web exploration and visualization*
	*Issue*: Interactive exploration and visualization tools for Semantic Web resources are required to make effective queries
	*Result*: Tools are reviewed from viewpoints of requirements and availability, features, assistance and support, technical aspects, and specificity to life sciences use cases
	*Software*: More than 30 tools currently available are reviewed and classified for benchmarking and evaluations in the future
	*Ontology mapping visualization*
	*Issue*: Visualization of ontology mapping is required to understand how different ontologies with relating concepts are interconnected
	*Result*: Ontology mappings of all BioPortal ontologies and a subset of BioPortal ontologies suitable for OntoFinder/Factory were visualized
	*Software*: Applicability of Google Fusion Tables and Gephi were investigated
	*Identifier conversion service*
	*Issue*: Multiple synonyms for the same data inhibits cross-resource querying and data mining
	*Result*: Developed a new service to extract cross references from UniProt and KEGG databases, eliminate redundancy and visualize the result
	*Software*: G-Links resolves and retrieves all corresponding resource URIs
	*Semantic query via voice recognition*
	*Issue*: Intuitive search interface similar to “Siri for biologists” would be useful
	*Result*: Developed a context-aware virtual research assistant Genie which recognizes spoken English and replies in a synthesized voice
	*Software*: The G-language GAE, G-language Maps, KBWS EMBASSY and EMBOSS, and G-Links are used for Genie

Back to article page

ISSN: 2041-1480

Contact us

General enquiries: journalsubmissions@springernature.com