- Open Access
Linking genes to diseases with a SNPedia-Gene Wiki mashup
© Good et al; licensee BioMed Central Ltd. 2012
- Published: 24 April 2012
A variety of topic-focused wikis are used in the biomedical sciences to enable the mass-collaborative synthesis and distribution of diverse bodies of knowledge. To address complex problems such as defining the relationships between genes and disease, it is important to bring the knowledge from many different domains together. Here we show how advances in wiki technology and natural language processing can be used to automatically assemble ‘meta-wikis’ that present integrated views over the data collaboratively created in multiple source wikis.
We produced a semantic meta-wiki called the Gene Wiki+ that automatically mirrors and integrates data from the Gene Wiki and SNPedia. The Gene Wiki+, available at (http://genewikiplus.org/), captures 8,047 distinct gene-disease relationships. SNPedia accounts for 4,149 of the gene-disease pairs, the Gene Wiki provides 4,377 and only 479 appear independently in both sources. All of this content is available to query and browse and is provided as linked open data.
Wikis contain increasing amounts of diverse, biological information useful for elucidating the connections between genes and disease. The Gene Wiki+ shows how wiki technology can be used in concert with natural language processing to provide integrated views over diverse underlying data sources.
- Resource Description Framework
- SPARQL Query
- Disease Ontology
- Semantic Medium
- Human Disease Ontology
One of the key goals of current biomedical research is the elucidation of the relationships that hold between genes, environment and disease. Tackling this complex challenge requires the coordination of information emerging from a variety of scientific and medical communities. Increasingly, wiki technology is being used to enable such communities to collaboratively synthesize and distribute their knowledge. These ‘bio-wikis’ are emerging in many different areas [1, 2]. We have wikis about genes [3–5], proteins , protein structures [7, 8], SNPs , pathways , specific organisms  and many other biological entities. Bio-wikis have become important concept-centric knowledge resources but no single wiki contains all of the knowledge needed to answer most biological questions. The task of integrating the knowledge across different wikis remains the job of the end user. Recently, three key factors have emerged that make it possible to dynamically produce ‘meta-wikis’ that provide end users with consolidated views of information spanning multiple underlying wikis.
The first factor is the widespread adoption of the MediaWiki software for implementations within the bio-wiki community. MediaWiki installations now provide a powerful web API (Application Programming Interface) for direct, high-level access to the data contained in their databases. The API uses RESTful calls  to permit automated processes to make queries and post changes. Since many of the bio-wikis now have, by default, the same API, this implies that the same software can be used to query and edit the content of many different wikis without alteration.
The second factor is the increasing adoption of standardized systems for describing and recognizing biological concepts across multiple sites. Such systems provide identifiers for genes (e.g. NCBI Gene ids) and other biological concepts (e.g. Gene Ontology terms). These shared names can be used to identify when two different wikis contain knowledge pertinent to the same things and hence provide the key starting point for integration.
The third factor is the Semantic extension to the Media Wiki system . By installing this extension, wiki administrators make it possible to add semantic links between articles in the wiki - for example, GeneX hasSNP snpY. These semantic links can be used in queries to the system that are much like queries to a database.
The combination of a consistent API across many different wikis, a growing collection of unifying ontologies and the Semantic extension enables the rapid formation of wiki-mashups or ‘meta-wikis’. Such meta-wikis offer the potential to produce integrated views of the knowledge dispersed across many different sources. Here we show how an automatically generated meta-wiki, called the Gene Wiki+, composed of elements drawn from SNPedia and the Gene Wiki exposes substantially more evidence of links between genes and diseases than either resource contains independently.
SNPedia provides textual information about links between variations in human genes and human phenotypes . It uses standard identifiers from trusted authorities - primarily dbSNP  - to enable extensive linking to other public bioinfomatics databases and to personal genomics companies like 23andMe . It is not a comprehensive listing of SNPs, rather it focuses on SNPs that have some evidence of association with a human phenotype.
The Gene Wiki is an attempt to generate a collaboratively written, continuously updated review article for every human gene . It provides textual descriptions of gene function in normal conditions as well as descriptions of the role the gene may play in disease. Currently, it includes more than 10,450 Wikipedia articles about human genes .
The Human Disease Ontology was initiated in 2003 for the purpose of integrating biomedical data . It is an open source, hierarchical controlled vocabulary of human diseases, now included as one of the OBO Foundry ontologies . It is currently organized under 8 top-level categories: ‘disease by infectious agent’, ‘disease of anatomical entity’, ‘disease of cellular proliferation’, ‘disease of mental health’, ‘disease of metabolism’, ‘genetic disease’, ‘medical disorder’ and ‘syndrome’.
SNPedia + Gene Wiki + Disease Ontology = Gene Wiki+
The Gene Wiki+ is a meta-wiki composed of information about genes drawn from the Gene Wiki, information about SNPs from SNPedia and Disease Ontology terms identified in both resources using the NCBO Annotator . Bringing information about genes and diseases from both the Gene Wiki and SNPedia together into the Gene Wiki+ allows us to better address the following important question.
“Based on what we know now, what genes are linked to which diseases?”
To start, we installed the MediaWiki framework with the Semantic MediaWiki extension. We then set up a procedure to import all the Gene Wiki pages from Wikipedia and all SNP pages from SNPedia linked to a Gene Wiki gene using a program we created called SyncBot (Figure 1.1a, 1.1b). For the initial import, SyncBot passed each article’s content through the NCBO Annotator Web service to find occurrences of terms from the Human Disease Ontology. The bot then copied the page to the new wiki (Figure. 1.2), and appended the Disease Ontology terms detected in the Annotator to the end of each article (Figure. 1.3), forming semantic links between the gene or SNP and any associated diseases. In addition, the bot appended gene pages with the identifier of each SNP that occurred on the gene. This established the gene-to-disease and gene-SNP-disease links shown in Figure 1.4.
After the initial import, the bot continuously monitors the source wikis for changes to the original articles. When the source content is updated, the bot repeats the import and annotation process with only the changed pages, thus ensuring that the Gene Wiki+ stays up-to-date. (The code for the SyncBot is open source and can be accessed from the Gene Wiki+ software download page .)
Gene Wiki+ seamlessly integrates a multitude of knowledge sources. Both source wikis leverage and filter numerous large public datasets, such as dbSNP, HapMap, OMIM, PDB, PharmGKB and PubMed. Software bots from a distributed developer community continuously update both SNPedia and the Gene Wiki with structured content. SNPedia bots mine the scientific literature to populate new textual wiki content, whereas all textual content on the Gene Wiki is manually entered. All content in both wikis is curated and expanded by a human editor community. All of this information flows into the Gene Wiki+ through a continuous one-way syncing process providing users with up-to-date access to a diverse body of gene and disease related knowledge.
The 479 gene-disease pairs in the overlap contained 227 distinct diseases and 376 distinct genes. For example, the gene CYSLTR1 is linked to asthma in the text of the Gene Wiki: “The cysteinyl leukotrienes...] are important mediators of human bronchial asthma” and in the text of the SNPedia article on rs320995 (a SNP found in the CYSLTR1 gene): “subjects without T-allele in SNP rs320995 had 3.1 times higher risk of asthma” .
As Figure 2 clearly illustrates, both the Gene Wiki and SNPedia contain substantial amounts of knowledge pertinent to the challenge of finding associations between genes and diseases. The low level of overlap between the gene-disease associations found in these resources indicates the potential value of their combination.
Gene Wiki+ RDF
One of the key advantages of the Semantic Media Wiki framework is its ability to generate structured exports of the knowledge it contains that adhere to the Resource Description Framework (RDF) standard. This makes it possible to take advantage of the growing collection of tools built on this standard to conduct analysis of the data. Loading the RDF exported from Gene Wiki+ into a triplestore such as Jena, AllegroGraph, or OpenLink Virtuoso makes it possible to execute SPARQL queries, to integrate the data with other RDF resources, and to take advantage of Semantic Web reasoning systems. (SPARQL is the standard query language for RDF.)
As one example, the 479 gene-disease pairs in the overlap between the Gene Wiki and SNPedia can be identified with the following SPARQL query executed in a triplestore.
PREFIX wiki: <http://genewikiplus.org/wiki/Special:URIResolver/>
Inference in Gene Wiki+
This query produced 237 results - all of which were obtained through inference (i.e. no gene was annotated directly to the term ‘disease of mental health’).
The RDF export process translates MediaWiki categories into simple OWL (Web Ontology Language) classes without logical definitions and sub-category relationships into RDF-Schema subclass relations. As a result, the export contains a mirror of the class hierarchy from the Disease Ontology. This makes it possible to utilize basic reasoning over this hierarchy when processing the exported RDF.
The Gene Wiki+ as Linked Data
In addition to manipulating the data in Gene Wiki+ with RDF processing tools, the RDF export facilitates integration with other data sources in the Linked Data cloud. All of the articles from the Gene Wiki are linked via owl:sameAs relationships to the equivalent entities in DBPedia . As DBpedia forms one of the primary hubs of the Linked Data cloud , this connection ties the Gene Wiki+ meta wiki directly into the rest of the Semantic Web. Aside from DBpedia integration, all of the categories brought in from the Disease Ontology are marked with OWL:equivalentClass relationships to the corresponding classes in the Disease Ontology as indicated by their OBO URIs. For example, the Gene Wiki+ category for Vulvar keratoacanthoma-like carcinoma is marked as equivalent to the entity in the Disease Ontology identified by the URI http://purl.obolibrary.org/obo/DOID_7408. By supplying stable URIs, facilitating data access as RDF, and maintaining clear relationships to other Semantic Web resources, the Gene Wiki+ meta-wiki is ready for use in RDF-based applications that facilitate interaction with data drawn dynamically from multiple sources.
Enhancements to the user experience
The low amount of overlap between the gene-disease relationships found in the Gene Wiki and the gene-SNP-disease relationships from SNPedia was surprising. Based on a review of the meta-wiki assembly process and manual inspections of the data, we could not determine any particular systematic driver of the observed differences. It appears that the two data sources simply contain different data. As one example, the text of the Gene Wiki article on the Dopamine receptor D3 gene contains a reference to its relationship to Parkinson’s disease, backed by , that is not captured on any of its seven related SNPs. At the same time, the related SNP Rs167770 indicates a relationship to breast cancer, supported by , that is not represented in the text of the Gene Wiki article (Figure 3). While this is just an anecdote, it is indicative of the underlying trend in the data as presented in Figure 2.
There are many reasons why a particular gene might be associated with a disease in a Gene Wiki article that do not implicate a particular SNP. For example, genes may be involved in pathways known to be important to disease pathogenesis or to the body’s immune response while there may not be any known SNPs associating that gene with that disease. On the other hand, if a SNP linked to a particular gene is related to a disease, it seems likely that the gene-disease connection should be made in the text of the article about the gene. In the latter case, the gene-SNP-disease associations provide excellent starting points for potential expansions of Gene Wiki articles.
It is important to note that there is no well-accepted standard reference database for structuring and curating gene-disease associations. While several databases provide information relating genes to diseases e.g. PharmGKB, Online Mendelian Inheritance in Man (OMIM), and UniProt (see ‘involvement in disease’ attribute), the information in each one is different, each uses their own vocabulary and none has emerged as a unifying standard. In this work we seek not to replace such curated resources, but to provide a crowd-sourced complement to them.
One of the weaknesses of the approach used to build this meta-wiki is that it represents a one-way sync. If editors make changes to the articles in the meta-wiki, there is currently no automated mechanism for migrating those changes back to the articles in the original wikis. While one option is to let these meta-wikis evolve independently of their parents, a better approach might be to establish mechanisms through which edits made to a meta-wiki article could flow back into the articles used to create them. Such a mechanism would effectively extend the reach of the source wikis - both in terms of exposing their contents and of acquiring more editors. There are tools emerging that will make this possible. For example, the Distributed Semantic Media Wiki system is an extension that enables the creation of a network of Semantic Media Wiki servers that share common semantic wiki pages . With such a system in place, we might imagine that meta-wikis like the one discussed here could serve not only as new integrated resources for consuming information but also new points for users to contribute information back to the community collection.
We have demonstrated how a high-level linking of genes and diseases can be accomplished through the meta-wiki approach, but we have not touched on the deeper, more difficult question of how these genes are linked to these diseases. To address this complex challenge, the work of thousands of specialists needs to be assembled into integrated wholes that can be understood and used to drive action. The topic-focused wikis emerging in different areas of biology represent one step of this process of collaborative knowledge synthesis. Looking forward, meta-wikis such as the one presented here offer the potential to go one step further - to help unearth and present the latent relationships that exist between different concepts and different communities.
This work was supported by grants GM083924 and GM089820 from the National Institute of General Medical Sciences.
This article has been published as part of Journal of Biomedical Semantics Volume 3 Supplement 1, 2012: Proceedings of the Bio-ontologies Special Interest Group 2011. The full contents of the supplement are available online at http://www.jbiomedsem.com/supplements/3/S1
- Callaway E: No rest for the bio-wikis. Nature. 2010, 468: 359-360.View ArticleGoogle Scholar
- Waldrop M: Big data: Wikiomics. Nature. 2008, 455: 22-25. 10.1038/455022a.View ArticleGoogle Scholar
- Huss JW, Lindenbaum P, Martone M, Roberts D, Pizarro A, Valafar F, Hogenesch JB, Su AI: The Gene Wiki: community intelligence applied to human gene annotation. Nucleic Acids Res. 2010, 38: D633-639. 10.1093/nar/gkp760.View ArticleGoogle Scholar
- Huss JW, Orozco C, Goodale J, Wu C, Batalov S, Vickers TJ, Valafar F, Su AI: A gene wiki for community annotation of gene function. PLoS Biol. 2008, 6: e175-10.1371/journal.pbio.0060175.View ArticleGoogle Scholar
- Hoffmann R: A wiki for the life sciences where authorship matters. Nat Genet. 2008, 40: 1047-1051. 10.1038/ng.f.217.View ArticleGoogle Scholar
- Mons B, Ashburner M, Chichester C, van Mulligen E, Weeber M, den Dunnen J, van Ommen GJ, Musen M, Cockerill M, Hermjakob H: Calling on a million minds for community annotation in WikiProteins. Genome Biol. 2008, 9: R89-10.1186/gb-2008-9-5-r89.View ArticleGoogle Scholar
- Stehr H, Duarte JM, Lappe M, Bhak J, Bolser DM: PDBWiki: added value through community annotation of the Protein Data Bank. Database (Oxford). 2010, 2010: baq009-10.1093/database/baq009.View ArticleGoogle Scholar
- Weekes D, Krishna SS, Bakolitsa C, Wilson IA, Godzik A, Wooley J: TOPSAN: a collaborative annotation environment for structural genomics. BMC Bioinformatics. 2010, 11: 426-10.1186/1471-2105-11-426.View ArticleGoogle Scholar
- SNPedia. [http://www.snpedia.com]
- Pico AR, Kelder T, van Iersel MP, Hanspers K, Conklin BR, Evelo C: WikiPathways: pathway editing for the people. PLoS Biol. 2008, 6: e184-10.1371/journal.pbio.0060184.View ArticleGoogle Scholar
- Flórez LA, Roppel SF, Schmeisky AG, Lammers CR, Stülke J: A community-curated consensual annotation that is continuously updated: the Bacillus subtilis centred wiki SubtiWiki. Database. 2009, 2009: 9-View ArticleGoogle Scholar
- Fielding R: Architectural Styles and the Design of Network-based Software Architectures. Doctoral dissertation. 2000, University of California, IrvineGoogle Scholar
- Krotzsch M, Vrandevcic D, Volkel M, Haller H, Studer R: Semantic Wikipedia. Journal of Web Semantics. 2007, 5: 251-261. 10.1016/j.websem.2007.09.001.View ArticleGoogle Scholar
- Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Federhen S: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2011, 39: D38-51. 10.1093/nar/gkq1172.View ArticleGoogle Scholar
- 23andme. [http://www.23andme.com]
- Good BM, Clarke EL, de Alfaro L, Su AI: The Gene Wiki in 2011: community intelligence applied to human gene annotation. Nucleic Acids Research. 2011Google Scholar
- Osborne JD, Flatow J, Holko M, Lin SM, Kibbe WA, Zhu LJ, Danila MI, Feng G, Chisholm RL: Annotating the human genome with Disease Ontology. BMC Genomics. 2009, 10 (Suppl 1): S6-10.1186/1471-2164-10-S1-S6.View ArticleGoogle Scholar
- Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, Goldberg L, Eilbeck K, Ireland A, Mungall C: The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol. 2007, 25: 1251-1255. 10.1038/nbt1346.View ArticleGoogle Scholar
- Jonquet C, Shah NH, Musen MA, Ontology B: The Open Biomedical Annotator. AMIA Summit on Translational Bioinformatics. 2009, 56-60.Google Scholar
- GeneWiki+. [http://genewikiplus.org/]
- DBpedia. [http://dbpedia.org]
- Bizer C, Lehmann J, Kobilarov G, Auer Sor, Becker C, Cyganiak R, Hellmann S: DBpedia - A crystallization point for the Web of Data. Web Semantics: Science, Services and Agents on the World Wide Web. 2009, 7: 154-165. 10.1016/j.websem.2009.07.002.View ArticleGoogle Scholar
- Hong X, Zhou H, Tsai HJ, Wang X, Liu X, Wang B, Xu X: Cysteinyl leukotriene receptor 1 gene variation and risk of asthma. Eur Respir J. 2009, 33: 42-48. 10.1183/09031936.00057708.View ArticleGoogle Scholar
- Joyce JN, Millan MJ: Dopamine D3 receptor agonists for protection and repair in Parkinson's disease. Curr Opin Pharmacol. 2007, 7: 100-105. 10.1016/j.coph.2006.11.004.View ArticleGoogle Scholar
- Sangrajrang S, Sato Y, Sakamoto H, Ohnami S, Khuhaprema T, Yoshida T: Genetic polymorphisms in folate and alcohol metabolism and breast cancer risk: a case-control study in Thai women. Breast Cancer Res Treat. 2010, 123: 885-893. 10.1007/s10549-010-0804-4.View ArticleGoogle Scholar
- Skaf-Molli H, Canals G, Molli P: DSMW: Distributed Semantic MediaWiki. Proceedings of ESWC 2010. 2010, 2: 426-430.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.