Semantically enabling a genome-wide association study database
© Beck et al.; licensee BioMed Central Ltd. 2012
Received: 3 April 2012
Accepted: 22 August 2012
Published: 17 December 2012
Skip to main content
© Beck et al.; licensee BioMed Central Ltd. 2012
Received: 3 April 2012
Accepted: 22 August 2012
Published: 17 December 2012
The amount of data generated from genome-wide association studies (GWAS) has grown rapidly, but considerations for GWAS phenotype data reuse and interchange have not kept pace. This impacts on the work of GWAS Central – a free and open access resource for the advanced querying and comparison of summary-level genetic association data. The benefits of employing ontologies for standardising and structuring data are widely accepted. The complex spectrum of observed human phenotypes (and traits), and the requirement for cross-species phenotype comparisons, calls for reflection on the most appropriate solution for the organisation of human phenotype data. The Semantic Web provides standards for the possibility of further integration of GWAS data and the ability to contribute to the web of Linked Data.
A pragmatic consideration when applying phenotype ontologies to GWAS data is the ability to retrieve all data, at the most granular level possible, from querying a single ontology graph. We found the Medical Subject Headings (MeSH) terminology suitable for describing all traits (diseases and medical signs and symptoms) at various levels of granularity and the Human Phenotype Ontology (HPO) most suitable for describing phenotypic abnormalities (medical signs and symptoms) at the most granular level. Diseases within MeSH are mapped to HPO to infer the phenotypic abnormalities associated with diseases. Building on the rich semantic phenotype annotation layer, we are able to make cross-species phenotype comparisons and publish a core subset of GWAS data as RDF nanopublications.
We present a methodology for applying phenotype annotations to a comprehensive genome-wide association dataset and for ensuring compatibility with the Semantic Web. The annotations are used to assist with cross-species genotype and phenotype comparisons. However, further processing and deconstructions of terms may be required to facilitate automatic phenotype comparisons. The provision of GWAS nanopublications enables a new dimension for exploring GWAS data, by way of intrinsic links to related data resources within the Linked Data web. The value of such annotation and integration will grow as more biomedical resources adopt the standards of the Semantic Web.
The database resource GWAS Central http://www.gwascentral.org (established in 2007, then named HGVbaseG2P ) is a comprehensive central collection of genetic association data with a focus on advanced tools to integrate, search and compare summary-level data sets. GWAS Central is a core component of the GEN2PHEN project http://www.gen2phen.org, which aims to unify human and model organism genetic variation databases. The modular architecture of GWAS Central allows the infrastructure to be extended for use with different types of data, and it is anticipated that through future support from the BioSHaRE project http://www.bioshare.eu, GWAS Central will be extended to integrate exome and next-generation sequencing data.
Currently, GWAS Central collates data from a range of sources, including the published literature, collaborating databases such as the NHGRI GWAS Catalog , and direct submissions from collaborating investigators. A given study represented in GWAS Central may investigate the genetic association to a single phenotype, or a range of phenotypes, associated with a disease of interest. In the case of multiple phenotypes, “sub-studies” will be reported as separate experiments. For example, a single GWAS may identify common genetic variation altering the risk to type 2 diabetes susceptibility, and so report the results from single or multiple experiments investigating related traits such as fasting plasma glucose levels, insulin sensitivity index, insulin response or findings from a glucose tolerance test. GWAS Central captures this distinction and reports the individual phenotype tested as well as the disease of interest.
GWAS Central currently holds 1664 reported phenotypes (Figure 1; blue line). Identical phenotypes may be described differently between studies due to inconsistencies associated with variations in terminology use and in editorial style of authors when describing the phenotypes. A pragmatic solution was required to allow harmonisation of the GWAS phenotype descriptions to facilitate consistent querying within GWAS Central, and to ensure that the phenotype data can be accessed and understood using a semantic standard to allow data integration.
The benefits of ontologies in resolving ambiguity associated with divergent and “free-text” nomenclature are well documented . The issues surrounding the reusability of phenotype descriptions within GWAS Central are typical of problems tackled by groups working on the controlled vocabulary of other model organisms, for example yeast , worm  and mouse . In these cases, either new phenotype ontologies were built or existing ontologies were applied within a meaningful annotation framework.
The Open Biological and Biomedical Ontologies (OBO) Foundry is an initiative involving the developers of life-science ontologies and is tasked with setting principles for ontology development. OBO’s goal is to coordinate development of a collection of orthogonal interoperable biomedical ontologies to support data integration . The application of two OBO Foundry principles in particular suggest that the development of a new ontology to capture human phenotype data derived from GWAS would not be in the community’s best interest. These principles assert that new ontologies must be, firstly, orthogonal to other ontologies already lodged within OBO, and secondly, contain a plurality of mutually independent users .
One candidate OBO Foundry ontology in name alone - the Human Phenotype Ontology (HPO)  - indicates immediate overlap with our domain of interest (GWAS phenotypes). Further human phenotype-related ontologies are also available from the National Center for Biomedical Ontology (NCBO) BioPortal , for example Medical Subject Headings (MeSH)  and the International Classification of Diseases (ICD) . Despite OBO Foundry efforts in promoting the creation of orthogonal ontologies, there is still a high rate of term reuse, with a recent study reporting 96% of Foundry candidate ontologies using terms from other ontologies . The prevalence of term reuse and redundancy between ontologies leaves potential users asking the obvious question “which ontology do I use?”.
The ambiguity in arriving at an obvious candidate ontology can have a devastating effect on system interoperability and data interchange. We believe the development of a dedicated GWAS phenotype ontology would compound that problem. Additionally, since 2007 when HGVbaseG2P was established, there has been no call for a dedicated GWAS phenotype ontology from other quarters, so also failing the “plurality of users” principle. Consideration of these factors led us to favour an approach that involves the application of existing ontologies within the GWAS Central data model.
In the context of the genetic analysis of human disease, and thus GWAS, the term ‘phenotype’ is used to define an aggregated set of medically and semantically distinct concepts. Traits and phenotypes are often considered synonymous, however they are distinct domains within Ontology. A trait is a heritable, measurable or identifiable characteristic of an organism such as systolic blood pressure. Phenotype is a scalar trait , essentially a trait with a value, such as increased systolic blood pressure. GWAS typically report findings in relation to traits, for example “Genome-wide association study identifies eight loci associated with blood pressure” . Furthermore, human disease is a complex collection of phenotypic observations and pathological processes . The diagnosis of a disease depends upon identifying a set of phenotypes, which can be either medical signs or symptoms. A medical sign is an objective indication of a medical characteristic that can be detected by a healthcare professional such as blood pressure. A symptom is a subjective observation of the patient that their feeling or function has departed from the ‘normal’ such as experiencing pain. GWAS report genetic associations to diseases, for example, “Candidate single-nucleotide polymorphisms from a genomewide association study of Alzheimer disease” , and also medical signs and symptoms such as “Genome-wide association study of acute post-surgical pain in humans” .
During the course of this study, which sets out to implement a strategy for logically describing and distributing GWAS observations contained within the GWAS Central database resource to support GWAS data comparison, we examine these differing granularities of phenotypes (or traits). Nonetheless, in order to aid readability throughout this manuscript we use the term ‘phenotype’, unless otherwise stated, with the same all-encompassing meaning assumed by the biologist: namely, the observable characteristics resulting from the expression of genes and the influence of environmental factors.
A striking advantage of binding human GWAS phenotypes to an ontology is the ability to extend automatic cross-species analyses of phenotype and genotype information with comparative, suitably annotated, datasets. The laboratory mouse is a central model organism for the analysis of mammalian development, physiological and disease processes . It is therefore understandable that the mouse has been suggested as an ideal model for the functional validation of GWAS results .
A range of resources are available for the querying of mouse genotype-phenotype associations, such as: the Mouse Genome Database (MGD) which contains data loaded from other databases, from direct submissions, and from the published literature ; EuroPhenome, a repository for high-throughput mouse phenotyping data ; advanced semantics infrastructure involving development of a species-neutral anatomy ontology ; and finally a unified specification for representing phenotypes across species as entities and qualities (EQ)  that has been proposed to enable the linking of mouse phenotypes to human diseases and phenotypes for comparative genome-phenome analysis .
A major bottleneck in implementing high-throughput phenomic comparisons leveraging the above resources is the absence of a well annotated, controlled and accessible human disease genotype-phenotype dataset, and the necessary tools to access it.
The Semantic Web builds upon the Resource Description Framework (RDF) and related standards to give meaning to unstructured documents on the web to allow data to be understood, shared and reused. The term “Linked Data” is commonly used to refer to a specific approach to connecting data, information and knowledge on the Semantic Web that was not previously linked . These technologies and approaches have in recent years been slowly but surely infiltrating the life sciences domain to tackle diverse problems. A notable recent development is the Semantic Automated Discovery and Integration framework (SADI) , a set of conventions for using Semantic Web standards to automate the construction of analytical workflows.
In the field of disease genetics, applications of Semantic Web technologies range from publishing information held in curated locus-specific databases as Linked Data , to text-mining the published scientific literature for mutations found to affect protein structure and subsequently making methods and data accessible via the SADI framework [31, 32]. To our knowledge, this has not yet been done with GWAS data in a comprehensive fashion. In relation to the Linked Data approach specifically, enhancement of GWAS datasets (such as those made available via GWAS Central) with phenotype annotations published in Semantic Web compatible formats has the potential to facilitate integration with other, related, Linked Data resources, such as genes, proteins, diseases and publications [33, 34].
The complexity of GWAS data sets and associated metadata led us to adopt so-called “nanopublications” ; a recently developed framework for publishing one or more scientific assertions as Linked Data, wrapped into self-contained “bundles” which also contain the contextual information necessary for the interpretation of the assertion, as well as provenance, attribution and other key metadata. The nanopublishing approach has already been used to publish locus-specific data  and other biological datasets . Ultimately, by making a comprehensive GWAS dataset available as nanopublications we aim to provide a rich addition to the web of Linked Data, while also allowing researchers who contribute to primary GWAS publications to be properly attributed. This latter feature of nanopublications is a compelling reason for their use, particularly with the recent drive towards publishing data and metadata and creating incentives for researchers to share their data .
Several ontologies available from the NCBO BioPortal could be used to annotate part or all of the phenotypes described by GWAS. Some of the most relevant ones are either members of the Unified Medical Language System (UMLS) BioPortal grouping (for example, MeSH, ICD10 and SNOMED CT ) or categorised by BioPortal as being related to ‘Phenotype’ (for example, HPO). We attempted to objectively identify which ontology would be most suitable for the purpose of defining GWAS phenotypes.
To this end, we defined ontology suitability as the ability to capture the maximum number of phenotypes at the level of granularity at which they are described. Our ambition to find a single ontology capable of describing the broad spectrum of GWAS phenotypes was pragmatically driven by a requirement to have a single ontology to query the entire database against. If we were to query against the complete ontology graph we would require all phenotypes to be returned. Therefore, during this comparative study we would consider an ontology more suitable if it could describe (either by concept or by synonym) the condition “Fuchs endothelial dystrophy” compared to the more general “corneal disease” or, more generally still, the term “eye disease”.
Results from the automatic mapping of GWAS phenotypes to relevant human-related vocabularies in BioPortal
Number of mappings
Coverage of GWAS Central phenotypes
Number of unique mappings
Unique coverage of the total automatic mappings made
Human Disease Ontology (DO)
Human Phenotype Ontology (HPO)
International Classification of Diseases (ICD10)
Medical Subject Headings (MeSH)
Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT)
The decision to adopt MeSH as the “backbone” for GWAS phenotype annotations in GWAS Central was taken due to MeSH being more familiar to biologists compared to the clinically focussed SNOMED CT. MeSH is used by the U.S. National Library of Medicine’s MEDLINE database to index abstracts and is searchable in PubMed . By contrast, there are relatively few research-related implementations of SNOMED CT. Additionally, SNOMED CT is more difficult to navigate and manage compared to MeSH, with SNOMED CT containing just under 400,000 classes compared to just under 230,000 in MeSH (figures taken from BioPortal).
In addition, we assessed the novel mappings achieved by each vocabulary (Table 1). Novel mappings occurred when a free-text phenotype description mapped to a term in a single ontology. During the exact mapping process, MeSH uniquely contributed 15.4% of the total 332 exactly mapped terms, followed by SNOMED CT (9.9%) and HPO (4.8%). However, during the partial mapping SNOMED CT uniquely contributed 12.2% of the total 434 partially mapped terms, followed by HPO (6.9%) and MeSH (6.7%). Inspection of the mapping results showed that by switching from exact mapping to partial mapping, a free-text phenotype description such as “forced expiratory volume” that had previously uniquely mapped to the MeSH Descriptor “Forced Expiratory Volume”, could now map to a SNOMED CT term “Normal forced expiratory volume”. Similarly, the free-text phenotype description “ventricular conduction” that could not map to any of the terminologies during the exact mapping could uniquely map to the SNOMED CT term “Ventricular conduction pattern” during the partial mapping. Since HPO made the second highest unique contribution in the partial mappings we assessed the benefits HPO could make in the annotation of GWAS phenotypes.
The HPO is an ontology of phenotypic abnormalities that was developed in order to provide a standardised basis for computational analysis of human disease manifestations . The results from our ontology suitability analysis indicated that HPO would facilitate unique mapping of 30 GWAS phenotype descriptions during the partial mapping process. Manual inspection of these terms showed they were terms describing medical signs and symptoms, rather than disease names that have high coverage in the other ontologies investigated. For example, HPO can uniquely describe “Coronary artery calcification” (term identifier HP:0001717) rather than the disease for which this can be a clinical manifestation such as in “Gaucher Disease” (MeSH Descriptor identifier D005776).
The performance of HPO in mapping to GWAS traits increased from 7% for exact mappings to 13.4% for partial mappings (Table 1). Since HPO is an ontology of phenotypic abnormalities it contains many terms where the string “Abnormal” or similar precedes the trait. During the partial mapping, traits such as “number of teeth” mapped to partially related HPO terms such as “Abnormal number of teeth”, hence the improved performance of HPO in making unique term contributions during the partial mappings.
Not every medical sign and symptom in the GWAS Central phenotype description list could be mapped to HPO, due to either lack of an appropriate term or lack of a synonym. However, the HPO group seeks community engagement and there is a protocol in place for users to submit required terms for inclusion via the HPO term tracker . Regular updates of the central ontology file ensure the changes are disseminated in a timely manner. In addition, subsets of terms from HPO are undergoing deconstruction into EQ descriptions , thus facilitating the use of HPO in cross-species comparisons. These factors made HPO a candidate for the annotation of individual phenotypic abnormalities (medical signs and symptoms) within GWAS Central.
The relatively low coverage overall achieved through automatic term mapping suggests that human decision-making is required during the process of phenotype curation, in order to ensure the biological meaning is preserved during the selection of alternative but appropriate, lexically distinct, concepts.
MeSH is structured into a hierarchy of Descriptors (or Headings) under which Terms that are strictly synonymous with each other are grouped in a Concept category. The Descriptor/Concept/Term structure is adopted within GWAS Central. Each GWAS reported in GWAS Central undergoes a phenotype annotation process (see Methods). During the annotation process the original full-text published report of the GWAS is accessed via PubMed (or via communications with collaborating groups e.g. pre-publication reports) and all phenotypes for each experiment are manually curated with a MeSH Descriptor by a small team of postdoctoral experts to ensure a high level of quality and consistency.
Where possible, a Descriptor is assigned which is described by a Term that matches the phenotype under consideration exactly. Where an exact match cannot be found then the closest match is sought, usually by selecting the parent Descriptor in the hierarchy, from where the curator would expect the exact Descriptor to exist. For example, the phenotype “sporadic amyotrophic lateral sclerosis” would be annotated with the MeSH Descriptor “Amyotrophic Lateral Sclerosis”. If a published report has been indexed for MEDLINE, this indicates that subject analysts at the United States National Library of Medicine have examined the article and assigned the most specific MeSH terms applicable to the article . In these cases the GWAS Central curators will consider any phenotype-related MEDLINE MeSH Descriptors for use alongside any additional appropriate MeSH Descriptors.
Phenotypes in GWAS Central are annotated at the level of individual experiments. This is in contrast to the MEDLINE MeSH annotations made at the level of the whole publication, which identify phenotypes that are mentioned somewhere in the journal article. GWAS Central curators are required to ensure that the correct phenotypes are associated with the correct experiments, which in turn are associated with the correct analysis methods, analysis and sample panels, and genetic marker datasets as defined by the GWAS Central data model (definitions of these concepts are available from the GWAS Central glossary: http://www.gwascentral.org/info/reference/definitions-and-glossary).
MEDLINE indexing is not available for all articles at the time of inclusion in GWAS Central. Citations supplied by publishers are not indexed and are identified by the citation status tag [PubMed – as supplied by publisher], for example, the GWAS reported in the article by Paus et al. (2011) with a PubMed ID of 22156575 http://www.ncbi.nlm.nih.gov/pubmed/22156575. There can also be a delay from a GWAS report being made available in PubMed to it being indexed for MEDLINE, during which time the citation is assigned the status tag [PubMed – in progress]. Since GWAS Central is frequently updated to ensure it contains the very latest studies, it is usual for the most recent reports not to contain MEDLINE MeSH annotations at the time of import.
The GWAS Central interface allows phenotypes to be retrieved via browsing the hierarchy of Descriptors (only Descriptors which are used in annotations are rendered) or by searching for Terms using an auto-suggest text field.
The HPO defines the individual phenotypic abnormalities associated with a disease, rather than the disease itself. Therefore, when a disease name, such as “Creutzfeldt-Jakob Syndrome”, is used to describe a GWAS phenotype then a single HPO term representing the disease will not exist. Instead, HPO can be used to define the medical signs and symptoms associated with the disease. The HPO was originally constructed using data from the Online Mendelian Inheritance in Man (OMIM) database , and now provides comprehensive annotations of clinical phenotypes for OMIM diseases . These HPO-to-OMIM mappings are implemented alongside OMIM-to-MeSH term mappings in GWAS Central to provide automatically inferred clinical manifestations described by HPO for the originally assigned disease annotation described by MeSH. These phenotypes are “inferred” since they may or may not be present, or present in differing severities, in the GWAS participants contributing to a study. While all participants for a study share the characteristic of having been diagnosed with the disease, it is not possible to determine from the GWAS report which medical signs or symptoms contributed to the diagnosis. The inferred HPO phenotypes indicate which clinical manifestations could have contributed to the diagnosis.
The Mammalian Phenotype Ontology (MPO)  is used for classifying and organising phenotypic information related to the mouse and other mammalian species. MPO is the de facto standard for annotating mouse phenotypes in online resources. As a first step towards high-throughput phenotype comparisons between human and mouse, we have developed an analysis pipeline for the automatic retrieval of human and mouse ontology-annotated phenotype data for gene orthologs. A public version of this pipeline is available from the scientific workflow exchange community website myExperiment .
Starting from a list of human gene symbols, the mouse gene orthologs are determined.
GWAS Central is then queried for phenotypes associated with genes on the list for a given p-value threshold, and the corresponding MeSH annotation(s) retrieved. Each p-value represents the probability of obtaining the observed association between a genetic marker and a phenotype for the dataset, assuming the null hypothesis is true.
Next, the MGD is queried for MPO annotation(s) for the mouse ortholog genes.
Finally, EuroPhenome is queried for MPO annotation(s) made to the mouse orthologs for a given statistical significance limit.
The resulting lists present the ontology annotations made for the gene ortholog dataset and can be used for cross-species comparisons.
The following use case presents an example of the input and output of the pipeline:
The human BAZ1B gene is known to be deleted in the development disorder Williams syndrome . A researcher working on BAZ1B wishes to learn which phenotypes have been associated with the gene as a result of GWAS, and also which phenotypes have been associated with the mouse ortholog Baz1b gene. The researcher downloads the comparative pipeline from myExperiment and loads it into the Taverna workbench  installed on their PC.
Output from running the human-mouse phenotype comparison pipeline
Decreased body weight
Decreased circulating cholesterol level
Abnormal myocardial trabeculae morphology
Decreased circulating HDL cholesterol level
Abnormal myocardium layer morphology
Decreased body size
Decreased litter size
Decreased fetal size
Small parietal bone
Short nasal bone
Abnormal palatine bone morphology
Abnormal heart morphology
Double outlet right ventricle
Dilated heart left ventricle
Dilated heart right ventricle
Abnormal fourth branchial arch morphology
Decreased birth body size
Atrial septal defect
Muscular ventricular septal defect
Increased heart right ventricle size
Increased heart left ventricle size
Abnormal double-strand DNA break repair
Complete postnatal lethality
Follow-up searches of the primary data held in the respective databases are carried out to understand the annotations. GWAS Central shows a genetic marker in the BAZ1B gene (SNP rs1178979) with a high probability (p-value 2e-12) of being associated with genetically determining triglycerides, as determined during a GWAS involving white European and Indian Asian participants (see http://www.gwascentral.org/study/HGVST626). EuroPhenome shows that during the “Clinical Chemistry” procedure of a high-throughput phenotyping pipeline , the male Baz1b heterozygous knockout mouse line was detected as having decreased circulating cholesterol (p-value 7.76e-7) and HDL cholesterol (p-value 8.20e-6) levels compared to the background mouse strains. Taken together, these findings tentatively suggest a role for BAZ1B and its ortholog as a genetic determinant of circulating lipids in the human and mouse. The MGD annotations do not include a “lipid-type” phenotype, which may imply that this genotype-phenotype association has not been reported in the literature for the mouse.
Based on the reported association of the BAZ1B gene with the circulating lipid phenotype, and knowing that the Baz1b knockout mouse line is available (since annotations were obtained from EuroPhenome), the researcher could now prioritise further investigation of the BAZ1B gene and its orthologs.
It is important to note that since nanopublications are entirely RDF based and intended for consumption by machines, by themselves they are not human-readable. For user-friendly tools to query and visualise the information contained within GWAS Central, researchers are advised to use the main GWAS Central website [http://www.gwascentral.org].
Further information on using the Semantic Web resources available through GWAS Central is available from the website help pages [http://www.gwascentral.org/info/web-services/semantic-web-resources].
We adopted the use of MeSH to define GWAS phenotypes to meet the overriding requirement of being able to capture and organise all data within a single ontology for querying and comparison within GWAS Central. While SNOMED CT scored slightly higher in our automatic annotation analysis compared to MeSH, there are doubts over the suitability of SNOMED CT for use by biomedical researchers. SNOMED CT is a clinical terminology, and has been adopted by the NHS for use as a coding standard. However, concerns have been raised regarding its complexity having a detrimental impact on finding data coded to it . MeSH is more intuitive to biomedical researchers and has been shown to be capable of annotating all GWAS phenotypes at an informative level of granularity, albeit at a coarser granularity than originally described in some cases.
In order to assist our phenotype annotation process we have investigated the use of text-mining and mark-up tools to automate the extraction of relevant phenotype ontology terms from the GWAS literature. We focussed on the annotation of GWAS phenotypes with MeSH, since MeSH forms the “backbone” of GWAS Central annotations. A range of tools are available for the automatic annotation of free-text with MeSH Terms (see  for a review of four distinct methods for classifying text with MeSH). We investigated two tools that are well documented and are currently supported: the NCBO Annotator  and MetaMap . Both tools were used to annotate a subset of ten full-text GWAS articles with MeSH Terms. Curators also assessed the same subset and assigned MeSH Terms manually following the GWAS Central phenotype annotation process (see Methods).
While a detailed analysis of how the automated tools performed is out of the scope of this article, there was one commonality. Both tools could assign MeSH Terms (including phenotype-relevant terms) to GWAS studies as a whole, however during the manual annotation process MeSH Terms could be assigned to individual GWAS experiments in keeping with the GWAS Central data model. Currently, GWAS Central represents studies that are described in 147 different journal titles, with varying editorial styles. GWAS metadata is complex and understanding the associations between participant panels, methods, observations and genetic marker datasets, as required by the data model, can be challenging for expert curators.
For these reasons, we conclude that there is currently little benefit in incorporating automatic text-annotation using the tools we have evaluated. Nonetheless, we are encouraged to further investigate the possibility of building on the principles of these tools and to develop an advanced text-mining and annotation strategy for future use in GWAS Central.
In the intervening years since the inception of HGVbaseG2P, and subsequently GWAS Central, complementary GWAS databases have embraced the benefits of using controlled vocabularies for the description of phenotypes. Two GWAS databases that currently make use of controlled vocabularies are the DistiLD database  and GWASdb .
The DistiLD database (reported in 2011) maps GWAS SNPs to linkage disequilibrium blocks and diseases where ICD10 is used to define the diseases. ICD10 is an ideal vocabulary for the description of disease phenotypes, but, as expected, resolution is lost when querying the dataset for non-disease traits. For example, a search for “blood pressure” on the main search page [http://distild.jensenlab.org] simply returns results from free-text searches of the publication titles and abstracts.
GWASdb (reported in 2011) allows exploration of genetic variants and their functional inferences, incorporating data from other databases including GWAS Central. Seventy percent of phenotypes in GWASdb are mapped to DOLite and the remainder are mapped to HPO . This prevents the use of a single ontology to query against the complete dataset. It is also unclear from the interface as to the level of granularity of the annotations, with only the first four levels of HPO accessible from the browser. By contrast, GWAS Central annotates up to level nine of HPO and it is therefore difficult to assess whether GWAS Central and GWASdb annotations agree for a given study.
A wider question remains as to the reproducibility of phenotype annotations between databases and the interchange of data bound to different standards. We have initiated coordination between complementary GWAS databases to ensure a unified set of annotations exist, mapped to all relevant semantic standards in use in the community (see the “GWAS PhenoMap” project at http://www.gwascentral.org/gwasphenomap/).
Our human-mouse phenotype comparison pipeline facilitates immediate retrieval of ontology-bound phenotype data for orthologous genes. Orthologous genes that do not share a phenotype could be novel candidates for the phenotype and thus could benefit from undergoing further study.
Phenotypes can be logically defined using ontologies by making an equivalence between terms in a pre-composed ontology (e.g. MeSH, HPO and MPO) and entity and quality (EQ) decompositions . For example, the MPO term “supernumerary teeth” is represented in EQ as “E: tooth + Q: having extra physical parts” (taken from the OBO Foundry mammalian phenotype logical definitions).
Comparison of the phenotypes generated from our pipeline is currently a manual process, but this could be optimised through using the EQ logical definitions of the pre-composed ontology terms. This would provide computer-interpretable definitions that could support reasoning to suggest, for example, that the MPO term “supernumerary teeth” and the HPO term “Increased number of teeth”, represented by the same logical definition (using a species-neutral anatomy ontology), are equivalent.
Encouragingly, work has begun on decomposing HPO musculoskeletal related terms into EQ definitions for the purpose of cross-species comparisons . As the EQ definition layer is progressed by domain experts into other categories of phenotypes covered by HPO, the possibility of making GWAS phenotypes available as EQ statements advances closer.
In an alternative approach, the PhenoHM human-mouse phenotype comparison server accepts phenotypes as input, rather than genes, and implements direct mappings from human (HPO) to mouse (MPO) ontologies  to identify human and mouse genes with conserved phenotypes. By comparison, our pipeline provides the flexibility to allow phenotypes from any ontology to be manually compared (from any database providing the relevant web services) and in theory the PhenoHM mappings could be extended to include MeSH and other ontologies. However, evaluation is required of the benefits of producing relatively quick ad hoc mappings between terminologies compared to a more time-consuming logical definition process that could facilitate more extensive cross-ontology comparisons.
Whichever method is employed, it will make reversing the pipeline an attractive possibility. Lists of orthologous phenotypes could serve as input for querying against human and mouse resources to retrieve associated genes, in order to answer questions such as “which gene is responsible for this phenotype in the mouse?”. In the immediate term we anticipate that the rich, high-quality GWAS phenotype annotations in GWAS Central will enhance the results of current and future cross-species comparisons involving the human.
By making genotype-phenotype associations available in a Linked Data-friendly form , GWAS Central has taken the first steps towards interoperability on the Semantic Web. Our prototype nanopublications were designed to link with and mesh into the broader web of Linked Data, by way of shared URI identifiers and ontologies for identifying and describing key entities in our domain of interest. This first-generation collection of GWAS nanopublications, though limited in scope and features, holds great potential for enriching the expanding network of semantically-enabled online information resources in the biomedical sphere.
It is important to emphasise that GWAS Central nanopublications are simply items of data, not statements of knowledge. For example, a p-value for a marker in a GWAS represents a statistical test of association that was factually observed in an experiment. This p-value is clearly not equivalent to a validated biological causal relationship between a genetic variant and a disease. There is some risk that eventual users of the data may confuse the two, especially given that GWAS nanopublications will be distributed widely and consumed outside of the “parent” GWAS Central resource itself. This is not a reason to avoid nanopublishing as such, but it does underline the importance of including appropriate metadata describing context and provenance along with, and clearly linked to, the core assertions.
As new tools are developed to reduce the technical knowledge required to semantically enable resources (e.g. the D2RQ Platform  and Triplify ) and leave bioinformaticians with the job of simply organising their data, it seems obvious that increasing numbers of biomedical resources will become semantically enabled in the near future. As and when this happens, we intend to further expand the set of Linked Data resources that our GWAS nanopublications link out to, thereby increasing their utility when consumed by other semantic tools. We are also planning to further expand the semantic capabilities of GWAS Central by exposing the association nanopublications, the SPARQL endpoint and the phenotype comparison pipeline (and future workflows we may develop) via the SADI framework.
We have made available high-quality phenotype annotations within a comprehensive GWAS database. We have considered the spectrum of phenotypes reported by published GWAS, ranging from diseases and syndromes to individual medical signs and symptoms, and adopted a suitable annotation framework to capture phenotypes at the finest level of granularity. All GWAS phenotypes are bound to a MeSH Descriptor to ensure the pragmatic necessity that a single ontology can be queried to retrieve all phenotype data. The HPO provides single phenotypic abnormality annotations either directly, mapped from MeSH, or inferred via deconstructions of disease phenotypes. A human-mouse phenotype comparative pipeline provides a valuable tool for comparison of human and mouse phenotypes for orthologous genes.
By providing GWAS Central data in the form of nanopublications and integrating this data into the Linked Data web, we present a platform from which interesting and serendipitous findings related to genotypes, phenotypes, and potentially other types of Linked Data, can be made.
In a manual step all descriptions were assessed to determine if they related to a trait or phenotype. To ensure consistency in the descriptions, and since the majority of descriptions related to traits, phenotypes were transformed to traits. This involved the removal of values assigned to traits e.g. “Hair color: black versus red” was transformed to the trait “Hair color”.
Since the ontologies under investigation express concepts in the singular form, we ran a script to remove plurals from the trait list.
British and American spellings are not synonymous in all ontologies, for example the HPO term “Abnormality of the esophagus” (HP:0002031) does not have the synonym “Abnormality of the oesophagus”. Therefore, British and American spelling differences were neutralised by providing both spellings for a word. A script split each trait description (term) into component strings (words) and queried the words against a list of words with spelling variants (source: http://en.wikipedia.org/wiki/Wikipedia:List_of_spelling_variants). Where a word was found to have a spelling variant a new term was created containing the word with the alternative spelling. The new term was appended, tab-separated, to the original term in the trait list.
The BioPortal REST web services allow for programmatic querying and comparison of the ontologies contained within BioPortal. In order to access the web services users are required to login to BioPortal to obtain an API key. The ‘Search’ web service queries a user-specified term against the latest versions of all BioPortal ontologies, thus eliminating the need to parse the latest version of an ontology in its native file format (e.g. OWL, OBO, UMLS format or custom XML). The ‘Search’ web service ignores capitalisation of both the user-specified term and the ontology terms. By default, the search attempts to find both partial and exact matches. During a partial search for a single word the wildcard character (*) is automatically appended to the end of the word, and for multiple-word searches the wildcard character is appended to the end of each word . The next stage of our analysis involved running a script to query each trait description against all BioPortal ontologies using the ‘Search’ web service. The web service was run twice for each term, with alternating ‘exact match’ arguments – this argument forces an exact match. During both runs for each trait description, the input was the normalised term, for example “Hair color”. The web service output was queried for matches in the ontologies of interest, namely DO, HPO, ICD10, MeSH and SNOMED CT. If a spelling variant did not return a match in at least one of the ontologies of interest, then the spelling alternative was also queried, for example “Hair colour”. The query term and the mapped ontology term were written to an output file. The total numbers of trait descriptions that map exactly and partially to the ontologies under investigation were recorded (Table 1). When a trait was mapped to a single term in only one of the ontologies (a unique mapping), the query term, the mapped ontology term and the ontology name were written to a second output file. The number of unique mappings for each ontology during the exact and partial searches was recorded (Table 1).
The initial ontology association between a phenotype and a genetic marker dataset is made during a manual curation process with the subsequent mappings made automatically. We use the MOLGENIS database management platform  as the basis for a curation tool. The GWAS Central data model can be viewed and edited through a series of connected forms (Figure 4). For each GWAS represented in GWAS Central a curator obtains the full-text report for the study and adds a new “sub-study” for each experiment. As the information is obtained from reading the report, the metadata for each experiment is entered into the curation tool to satisfy the GWAS Central data model, resulting in an experiment that is associated with sample panels, phenotype methods, analysis methods and a genetic marker dataset (see the GWAS Central glossary: http://www.gwascentral.org/info/reference/definitions-and-glossary). Each phenotype method contains a phenotype property that requires a phenotype annotation. The relevant MeSH Descriptor identifier is entered into the form. If a curator deems the annotation not to be an exact match, and instead the annotation is made using the closest available term, then this is flagged in the database. In these cases an appropriate HPO term will be manually sought.
MeSH is automatically mapped to HPO via UMLS. The cross-referenced UMLS concept unique identifier for a HPO term is obtained either from the source HPO OBO file http://compbio.charite.de/svn/hpo/trunk/src/ontology/human-phenotype-ontology.obo or via MetaMap , which maps free-text to the UMLS Metathesaurus. The MeSH identifier is then obtained from the cross-referenced UMLS entry. The HPO-to-OMIM mappings are automatically extracted from the mapping file downloaded from the HPO group’s website http://compbio.charite.de/svn/hpo/trunk/src/annotation/. The OMIM-to-MeSH mappings are manually assigned.
The human-mouse phenotype comparison pipeline uses the web services made available by the contributing data sources to ensure the latest data is accessed. A number of web services were used to return mouse ortholog genes for a list of human gene symbols and then return the corresponding annotated phenotypes for both sets. The Entrez Programming Utilities (E-Utilities) ESearch service  is used to validate the given list and retrieve Entrez IDs for the genes. The gene symbols for the mouse orthologs are retrieved from the MGI BioMart . The MGI and EuroPhenome BioMarts are accessed to retrieve the MPO terms annotated to the mouse ortholog gene list. The GWAS Central REST web service is accessed to retrieve the phenotype annotations for the human gene list. The public version of the pipeline was created using the workflow management system Taverna . Taverna offers users the ability to visualise and reuse web services within workflows via the Taverna workbench, which is an intuitive desktop client application. Taverna is also integrated with myExperiment, so facilitating the distribution of the pipeline and its reuse by the community in whole or in part.
To provide semantically enabled GWAS Central resources and integrate them into the Linked Data web, Perl modules originally created to search markers, phenotypes, association results and nanopublications in GWAS Central were extended to provide output in RDF, Turtle and in the case of nanopublications, N-Quads format. When navigating resources, the format to be returned to client applications is determined either through HTTP header content-type negotiation (application/rdf + xml, text/turtle or text/x-nquads), or through the use of a 'format' parameter (rdfxml, turtle or nquads) in the URI.
A Perl script utilising the above-mentioned search modules extracted all appropriate resources from GWAS Central as RDF, which were subsequently loaded into an RDF triple-store created using the Apache Jena TDB component . Jena was selected due to its support for the named graph extension that is an essential requirement for representing individual sections within nanopublications. The SPARQL end-point was set up using the Fuseki server .
Using the methodology of other GWAS data resources , we deem results with a p-value less than 10e-5 as showing an association and so these are included in our nanopublications. An example GWAS nanopublication and its associated connections with key external resources [68–70] are shown in Figure 5.
The GWAS Central phenotype annotations can be queried and viewed from the web interface at: http://www.gwascentral.org/phenotypes
The GWAS Central SPARQL end-point can be accessed at: http://fuseki.gwascentral.org
The human-mouse comparative phenotype pipeline described in this paper, named “get human and mouse phenotypes for a gene”, is available from myExperiment at: http://www.myexperiment.org/workflows/2131.html
Genome-wide association study/studies
Human Phenotype Ontology
International Classification of Diseases
Medical Subject Headings
Mouse Genome Database
Mammalian Phenotype Ontology
Open Biological and Biomedical Ontologies
Online Mendelian Inheritance in Man
Resource Description Framework
Systematized Nomenclature of Medicine - Clinical Terms
Unified Medical Language System.
We thank the curation team of Robert Hastings, Sirisha Gollapudi and Raheleh Rahbari for providing the high-quality phenotype annotations within GWAS Central. This work was funded by the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement 200754, GEN2PHEN.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.