Gene Wiki+ seamlessly integrates a multitude of knowledge sources. Both source wikis leverage and filter numerous large public datasets, such as dbSNP, HapMap, OMIM, PDB, PharmGKB and PubMed. Software bots from a distributed developer community continuously update both SNPedia and the Gene Wiki with structured content. SNPedia bots mine the scientific literature to populate new textual wiki content, whereas all textual content on the Gene Wiki is manually entered. All content in both wikis is curated and expanded by a human editor community. All of this information flows into the Gene Wiki+ through a continuous one-way syncing process providing users with up-to-date access to a diverse body of gene and disease related knowledge.
Overall, the Gene Wiki+ meta-wiki captures 8,047 distinct gene-disease relationships where genes are indicated by NCBI Gene identifiers and diseases are represented with Disease Ontology terms. There are 3,238 distinct genes and 1,060 distinct diseases represented in these gene-disease pairs. SNPedia accounts for 4,149 of the gene-disease pairs via gene-SNP-disease connections, while the Gene Wiki provides 4,377 via direct gene-disease associations. Only 479 (6%) of the gene-disease pairs appear independently in both sources (Figure 2).
The 479 gene-disease pairs in the overlap contained 227 distinct diseases and 376 distinct genes. For example, the gene CYSLTR1 is linked to asthma in the text of the Gene Wiki: “The cysteinyl leukotrienes...] are important mediators of human bronchial asthma” and in the text of the SNPedia article on rs320995 (a SNP found in the CYSLTR1 gene): “subjects without T-allele in SNP rs320995 had 3.1 times higher risk of asthma” [21].
As Figure 2 clearly illustrates, both the Gene Wiki and SNPedia contain substantial amounts of knowledge pertinent to the challenge of finding associations between genes and diseases. The low level of overlap between the gene-disease associations found in these resources indicates the potential value of their combination.
Gene Wiki+ RDF
One of the key advantages of the Semantic Media Wiki framework is its ability to generate structured exports of the knowledge it contains that adhere to the Resource Description Framework (RDF) standard. This makes it possible to take advantage of the growing collection of tools built on this standard to conduct analysis of the data. Loading the RDF exported from Gene Wiki+ into a triplestore such as Jena, AllegroGraph, or OpenLink Virtuoso makes it possible to execute SPARQL queries, to integrate the data with other RDF resources, and to take advantage of Semantic Web reasoning systems. (SPARQL is the standard query language for RDF.)
As one example, the 479 gene-disease pairs in the overlap between the Gene Wiki and SNPedia can be identified with the following SPARQL query executed in a triplestore.
PREFIX wiki: <http://genewikiplus.org/wiki/Special:URIResolver/>
PREFIX property: <http://genewikiplus.org/wiki/Special:URIResolver/Property-3A> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
Inference in Gene Wiki+
Since we imported the Human Disease Ontology class hierarchy into the Gene Wiki+ as MediaWiki categories, the subclass relationships from the ontology are accessible both within the wiki and in the exported RDF. Inside the Gene Wiki+, we can make use of Semantic MediaWiki’s category processing capability to process these hierarchical relationships. For example, we can ask for all genes related to diseases of mental health with the following inline query:
This query produced 237 results - all of which were obtained through inference (i.e. no gene was annotated directly to the term ‘disease of mental health’).
The RDF export process translates MediaWiki categories into simple OWL (Web Ontology Language) classes without logical definitions and sub-category relationships into RDF-Schema subclass relations. As a result, the export contains a mirror of the class hierarchy from the Disease Ontology. This makes it possible to utilize basic reasoning over this hierarchy when processing the exported RDF.
The Gene Wiki+ as Linked Data
In addition to manipulating the data in Gene Wiki+ with RDF processing tools, the RDF export facilitates integration with other data sources in the Linked Data cloud. All of the articles from the Gene Wiki are linked via owl:sameAs relationships to the equivalent entities in DBPedia [22]. As DBpedia forms one of the primary hubs of the Linked Data cloud [23], this connection ties the Gene Wiki+ meta wiki directly into the rest of the Semantic Web. Aside from DBpedia integration, all of the categories brought in from the Disease Ontology are marked with OWL:equivalentClass relationships to the corresponding classes in the Disease Ontology as indicated by their OBO URIs. For example, the Gene Wiki+ category for Vulvar keratoacanthoma-like carcinoma is marked as equivalent to the entity in the Disease Ontology identified by the URI http://purl.obolibrary.org/obo/DOID_7408. By supplying stable URIs, facilitating data access as RDF, and maintaining clear relationships to other Semantic Web resources, the Gene Wiki+ meta-wiki is ready for use in RDF-based applications that facilitate interaction with data drawn dynamically from multiple sources.
Enhancements to the user experience
While the aggregation of data from multiple sources in a queryable, structured form is useful for computational scientists, few ‘end-user’ biologists can be expected to enter SPARQL queries or even queries in the Semantic Media Wiki syntax. For the majority of users, the value of a meta-wiki such as this is in the direct improvements to the individual articles that they will discover while browsing. Hence we made three specific additions to the visible areas of the meta-wiki articles. First, we added a ‘known variants’ table to all of the gene articles. This table presents SNPs related to the gene described in the article and phenotypes related to those SNPs drawn from the data gathered from SNPedia. Next, we added a table displaying diseases found in the text of the Gene article. Figure 3 shows these enhancements to the article on Dopamine receptor D3.
In addition to the enhancements to the gene articles, we added a ‘related genes and SNPs’ table to the disease articles. (The disease articles were brought in from Wikipedia as part of the Gene Wiki import. Where no disease article existed in Wikipedia, we created a stub from the Disease Ontology definition.) This table presents genes and SNPs that are linked to the disease either in the text of a Gene Wiki article or through genetic associations found in SNPedia. Figure 4 shows how the article on peripheral neuropathy has been expanded with a section detailing related genes as well as related SNPs on these genes.