An overview of our method is:
-
1.
Decide upon the GOAL representation;
-
2.
Download GO, HDO and MPO and convert to OWL;
-
3.
Download annotations of mouse gene products from the MGI database [10] and convert to the GOAL representation;
-
4.
Create defined classes for each concept in the GO, MPO and HDO that links the notion of gene product to each of these attributes;
-
5.
Create the GOAL ontology by importing all the bits into one master ontology;
-
6.
Apply an automated reasoner to the GOAL ontology;
-
7.
Offer the GOAL ontology for subsumption querying through the construction of simple subclass queries based on the pre-built defined classes.
The GOAL representation
We take the five ontologies that describe aspects of gene products as they exist; we make no alterations to their form except to convert them into OWL and to add a ‘convenient’ root class to HDO and MPO (these two ontologies do not have a single root class, so disease gene product was created for the HDO and phenotype gene product was created for the MPO). We do this using the version of the OBO to OWL converter made available in the OWL API [39] version 3.2 downloaded from the OWL API website (http://owlapi.sourceforge.net/).
As there is no explicit representation of gene product in these ontologies, we created our own ontology to link gene products to the various aspects represented by the five ontologies. We chose the class Gene product as the top-level of our ontology as we can potentially describe both RNA and protein gene products.
Based on these considerations, the representation in OWL is fairly straight forward; for each kind of gene product, we generate an OWL class that has these gene products as instances, and we use the identifier of the kind of gene product as the class’ identifier and the gene product’s name as the class’ label. We then assert this class as a subclass of Gene product.
We use the following properties to create restrictions on Gene product classes with classes from each of the following ontologies:
Property: is_capable_of_process
Property range: GO:’biological process’
Definition: A relation between a material entity (such as a gene product) and a process. This property is asserted as a sub property of the OBO Relation Ontology capable_of in our GOAL ontology.
Property: is_capable_of_function
Property range: GO:’molecular function’
Definition: A relation between a material entity (such as a gene product) and a function. This property is asserted as a sub property of the OBO Relation Ontology capable_of in our GOAL ontology.
Property: is_located_in
Property range: GO:’cellular component’
Definition: See OBO_REL:located_in http://obofoundry.org/ro/#OBO_REL:located_in.
Property: is_associated_with_phenotype
Property range: MPO:’phenotype’
Definition: A relationship that associates members of the gene product class to at least one instance of a phenotype.
Property: is_associated_with_disease
Property range: HDO:’disease’
Definition: A relationship that associates members of the gene product class to at least one instance of a disease.
Using these properties, we generate the following Gene product class:
Class: ’Gene product’
SubClassOf: is_capable_of_function some GO: ’molecular function’,
is_located_in some GO: ’cellular component’,
is_capable_of_process some GO: ’biological process’,
is_associated_with_phenotype some MPO: ’phenotype’,
is_associated_with_disease some HDO: ’disease’
All restrictions upon the Gene product class are made with existential quantification; we ‘know’ that these relationships exist, but we do not ‘know’ that these are all the relationships that exist to these various aspects, so universal quantification cannot be used legitimately.
Gathering ontologies and gene product annotations
The GO annotations for 25,111 mouse genes were downloaded from the MGI website (http://informatics.jax.org/) in October 2011. We filtered these genes to exclude the RIKEN cDNA genes, and also discarded annotations to root GO terms from each of the biological process, molecular function and cellular component branches.
For MPO annotations, we utilized the MGI_Geno_Disease.rpt file available from the MGI ftp site (ftp://ftp.informatics.jax.org/pub/reports/index.html). The file includes identifiers for loss-of-function mutant mouse models together with the identifier of the gene that has been targeted. We extracted the gene identifier and associated it with the observed phenotypes using the is_associated_with_phenotype relation.
From the same file (MGI_Geno_Disease.rpt), we extracted the OMIM annotations of mouse models of human disease. These annotations were manually created by curators after review of the scientific literature. The HDO provides mappings to OMIM diseases, i.e., it contains pairs of HDO classes and their equivalent OMIM diseases. We used these mappings to generate HDO-based annotations of mouse models, and associated these with the diseases in HDO using the is_associated_with_disease relation.
Generating the OWL axioms
Instead of generating the axioms by hand, a Java program was written using the OWL API [39] to specify and instantiate the pattern for generating the class descriptions described in the introduction. For each class in the GO, MPO and HDO a new defined class was created that represented a gene product. The pattern we use is:
Class: ’?x gene product’
EquivalentTo: ’gene product’
that ?property some ?x
Where ?x is the class within GO, MPO or HDO, and ?property is substituted with the appropriate property described above. For example, for the mitochondrion class in GO we create a new class called mitochondrial gene product as follows:
Class: ’mitochondrion gene product’
EquivalentTo: ’gene product’
that is_located_in some ’mitochondrion’
Our strategy in creating such defined classes for each of the classes in GO, MPO and HDO was two-fold: It creates hierarchies of gene products over the actual classes of mouse gene products (as shown in Figure 1); This afforded a reasonably straight-forward mechanism to create more complex queries for the gene products. Our aim was to query through combining features from GO, MPO and HDO in any arbitrary combination. This will be complex if we ask users to write these subsumption queries according to the pattern for ’?x gene product’ described above. We can, however, make such queries easier by allowing simple intersecting classes to be made through the array of defined classes we generate. For instance, to ask for gene products that have a receptor activity, are participants in signal transduction and appear in the synaptic membrane, we will formulate the following query and ask the reasoner for it’s subclasses:
’signal transduction gene product’
and ’receptor activity gene product’
and ’synaptic membrane gene product’
This query is both a short form for, and logically equivalent to asking for subclasses of:
is_capable_of_process some GO: ’signal transduction’
and is_capable_of_function some GO: ’receptor activity’
and is_located_in some GO: ’syntactic membrane’
This form of querying makes it easier to develop a user interface for querying: classes are simply chosen and added to a list of classes over which to generate a defined class in the same pattern. Creating one defined class, a ‘singleton class’, for each class in GO, HDO and MPO gives sufficient building blocks for any query, whereas creating all possible combinations from the supporting ontologies is not possible and even a limited number will make for a cluttered and difficult user interface. We can still leave open the possibility of more complex queries using another OWL expressivity. These queries may utilize constructs such as disjunction. However, the disadvantages are that such queries require a more complex syntax and therefore a more complex user interface support, and raise the complexity of automated reasoning.
It is possible in this query mechanism to make queries that are biologically ‘nonsense’. The GO annotations, for instance, do not record explicitly the cellular locations in which different annotations for functions and activities take place. For example, gene products that participate in microtubule based locomotion do so only in the microtubule cellular component. Such genes may participate in other processes outside of that location, but such information is lost in the GO annotation. Therefore, it is possible to issue a query that combines function, biological process and location that recall gene products that do not hold that combination of attributes at the same time. This has long been recognised [6, 40], with fixes proposed such as simple statistical co-occurance [6, 40] and adding information from text-mining to incorporate this information. This is an important issue, but these approaches are only really patches for the problem. The GO, however, are releasing extensions to the GO that link between the various GO aspects [35, 36, 41]. For example, the occurs_in property is used to relate processes to the cellular component location at which they occur. We have used these GO extensions within GOAL and with increasing coverage of these relations, the accuracy of the enabled queries will increase.
Classifying the GOAL ontology
All portions of the GOAL ontology have been automatically generated. In order to browse and query the ontology we needed to classify the ontology. We kept the ontology in the OWL 2 EL profile [42], as automated reasoning for the OWL 2 EL profile is tractable [43, 44] and therefore enables fast querying. We explored which classifier was most rapid by using the following set of automated reasoners:
We classified the whole GOAL ontology 3 times and calculated the mean time in milliseconds for each classification. We utilised the Java ThreadMXBean library to compute thread CPU time for each classification. As the reasoners behave differently with respect to the way they load and pre-process the ontologies, we measured the time from when the reasoner is instantiated by the OWL API to the point at which the reasoner returned the answer to a query for all subclasses of OWL:Thing.
The GOAL user interface
We created a user interface using the Google Web Toolkit (GWT) [49]. The GOAL interface has the following design principles:
-
Allow elements of a simple intersecting query of named classes to be picked via browsing;
-
Allow more complex queries to be issued using Manchester OWL Syntax;
-
Show the subclasses that are also gene products for the generated query;
-
Each gene product is shown in the results table along with its OWL description expressed in Manchester OWL syntax.
to query interactively we do not need to classify for each query. The GOAL user interface is built on top of the OWL API, so we can classify once at deploy time; then each query is constructed behind the scenes and sent to the chosen reasoner through the OWL API. The results returned are then tabulated and displayed.