Ontological resources
Mammalian Phenotype Ontology (MP): We downloaded an MP version from [13] which was created on the 8th April 2011 and comprised 8,507 concepts. The formal definitions for MP were downloaded separately from the same source. The file provided 5,389 MP concepts with an associated formal definition.
Human Phenotype Ontology (HP): The HP version used for this study, was downloaded from [14]. It was created on the 7th April 2011 and contained 10,104 concepts. The formal definitions were downloaded separately from the same source and provided formal definitions for 4,860 concepts.
Databases containing gene-disease associations
We used two community-wide established resources containing manually verified gene and disease related data: the Mouse Genome Informatics (MGI) [15] and the Online Mendelian Inheritance in Man (OMIM) [16] database.
The MGI database integrates genetic, genomic and phenotypic information about the laboratory mouse For this study, three of the report files from the MGI database were downloaded [17]
-
♦ MGI_GenoDisease.rpt, accessed on 9th March 2011,
-
♦ MGI_GenePheno.rpt, accessed on 9th March 2011, and
-
♦ HMD_Human5.rpt, also accessed on 9th March 2011.
MGI GenoDisease.rpt contained associations between diseases and specific genotypes (one genotype corresponds to one mouse model) that can be linked to affected genes. MGI_GenePheno.rpt contained the information about genotypes and their observed phenotypes, which are described in MP. HMD_Human5.rpt covered the information about human-mouse orthologous genes.
The OMIM database collects information about human inheritable diseases, including genotype and phenotype information, and known gene-disease associations. It contains about 20,000 entries out of which around 13,000 describe genes and about 7,000 describe diseases. MorbidMap (downloaded on 1st March 2011) contains the up to date information about known links between human diseases and genes. The downloaded version for this study contained 2,717 diseases being linked to 2,266 genes, with 3,463 distinct gene-disease associations. Phenotypic information (HP annotations) for OMIM diseases are available from the HP web page [14]. The downloaded file comprised annotations for approximately 4,000 OMIM entries.
Mappings between species-specific phenotype ontologies
Mappings between ontologies
Let O1 and O2 be two ontologies with a set of named concepts C(O1) and C(O2). A mapping between O1 and O2 is a set of axioms Ax = {ϕ1(x1, y1), ..., ϕ
n
(x
n
, y
n
)} such that x
i
∈ C(O1) and y
j
∈ C(O2).
Here, we focus on mappings where the axioms relating concepts from two ontologies take the form of sub-class and equivalent-class axioms between atomic concepts. In particular, given the two concepts A ∈ O1 and B ∈ O2, a mapping involving both A and B will be of the form
-
♦ A SubClassOf: B, or
-
♦ B SubClassOf: A, or
-
♦ A EquivalentTo: B.
Generating mappings through lexical matching
In this study, we used the Lexical OWL Ontology Matcher (LOOM) [7] to generate the lexical matching of concepts between ontologies. LOOM was applied to HP and MP concept names and synonyms. Based on names and synonyms, LOOM extracted 495 HP-MP concept pairs in the form
HP:0002249 MP:0003292.
We imported both ontologies into one single ontology, inserted the pairs extracted by LOOM as equivalence statements and reasoned over the ontology. We generate the mapping by extracting the equivalent and super concepts belonging to the other ontology. In most cases, one concept from one ontology was mapped to multiple concepts from the other ontology.
An example of the resulting mapping looks like
HP:0007062 MP:0000001 MP:0002106 MP:0004142 MP:0004143 MP:0005369.
Due to both ontologies differing in their structure, the mappings are not symmetrical. For example, HP:0008590 'Progressive childhood hearing loss' maps to MP:0006325 'Impaired hearing' but MP:0006325 maps to HP:0000365 'Hearing impairment' (only most specific concepts are given in this example).
The resulting mappings together with the ontology file can be downloaded from the project web page http://code.google.com/p/ontmapcomp/.
Mapping through automated reasoning
PhenomeBLAST integrates the formal definitions that were created for classes from the HP and MP [18], including several other ontologies, such as Gene Ontology and UBERON. The ontologies are all converted into OWL EL to enable efficient automated reasoning [19]. PhenomeBLAST then uses the CB reasoner to classify the ontology [20]. To generate the mappings from MP to HP, PhenomeBLAST identifies all equivalent and superclasses of an MP class in HP, and vice versa for the direction of HP to MP. The mappings generated by the PhenomeBLAST software are available at http://phenomeblast.googlecode.com and for this study we downloaded the mappings provided (June 2011).
Direct comparison of mappings
The lexical matching method as well as the formal definitions method generate non-symmetrical mappings for each of the ontologies which results in four mappings in total (compare bottom two rows in table 1). Due to the non-symmetry, the generated mappings had to be investigated independently. For the concepts being represented with either method, we compared the lists of mapped concepts with each other and determined how well the lists overlapped. The direct comparison was executed for both ontologies independently, HP to MP and MP to HP.
Impact of mapping methods on applications
To assess and quantify the quality of mappings, we additionally used the biological use case of disease candidate gene prioritization to evaluate the performance of each method. For that purpose, we used the phenotype descriptions of mouse models contained in MGI GenePheno.rpt and the OMIM disease HP annotations. Due to the non-symmetry in mappings of either method, we investigated two different scenarios: in the first we "translated" the mouse model MP descriptions to HP using either methods' mapping, whilst for the second we "translated" the OMIM disease HP descriptions to MP. We identified the phenotype similarity between all possible combinations of mouse models and diseases by calculating the phenotype similarity. The phenotype similarity is the cosine similarity between the vector representations of a disease and a mouse model. The cosine similarity is described as:
(1)
In the first scenario, both feature vectors are built using MP concepts and in the second, both feature vectors contain HP concepts.
The phenotype similarity score for each disease-model pair was used to rank the mouse models according to their phenotype similarity for each disease. Then, we compared the obtained gene-disease (each mouse model is associated with one gene) pairs to OMIM and recorded the ranks of the known gene-disease associations to evaluate the performance of each method. In the absence of true negative examples, we assume that known gene-disease associations constitute positive examples while unknown associations constitute negative examples. The true and false positive rates are calculated across all diseases and over all mouse models possessing a phenotype representation compared to the in MorbidMap contained gene-disease associations. Both true and false positive rates are then used to draw the Receiver Operating Characteristics (ROC) curves (compare Figure 2) for both scenarios of the biological use case.