Equivalence
Names or terms referring to either a class of genes, DNA, proteins, RNAs and their splice variants, gene products, arbitrary transcripts or similar are considered to be equivalent within the context of the GENIA relation annotations. These classes are called genes/gene products (GGPs). For example, CD19, CD19 protein and CD19 gene may be considered to be equivalent and represent a single GGP.
We define a class G
C
based on a class C, which is assumed to be a subclass of DNA, and entities derived from C through chains of transcription and translation relations between individuals. The classes Protein, DNA and RNA are those used in the GENIA term ontology.
Such a formalization has the benefit of connecting the different kinds of GGPs through formal relations that can be exploited by an automated reasoner.
For example, the name “CD19 protein” refers to a class of proteins, and instances of this class stand in a translated-from relation to instances of a class of RNA which may be referred to as “CD19 RNA”. Instances of this class of RNA stand in a transcribed-from relation to instances of a class of DNA which may be referred to as “CD19 gene”. Thus, according to our definition, all three classes are subclasses of the GGP class G
CD
19.
Subclass
The class-subclass relation is used to annotate the relation between terms or names in the GENIA corpus where one term refers to a more general class than the other term. For example, this relation holds between the names “CD19 human” (denoting the class CD19 human) and “CD19” (denoting a class that is indistinguishable from the class CD19 (GGP)). We base the definition of the class-subclass relation upon the ontological is-a relation [22]: the classes C and D stand in the is-a relation, if and only if, every instance of C is also an instance of D.
For example, the referent of the name “human CD19 gene” (the class CD19 human gene) stands in the is-a relation to the referent of the name “CD19” (the GGP class CD19 (GGP)), because all instances of CD19 human gene are also instances of CD19 (GGP).
Mereological relations
The largest group of relations in the relationship annotations of the GENIA corpus refers to mereological relations, i.e., relations between parts and their wholes. Three kinds of parthood relations are distinguished within GENIA:
relations between a whole and its components, for example between the classes CD19 promoter and CD19,
relations between a collection and its members, as between Hox gene family and HOXA1,
the relation between an entity and the location at which this entity exists, such as CD19 which is located at CD19 locus.
Substantial work has already been undertaken with regard to mereological relations and their representation in OWL and biomedical ontologies [20, 23, 24]. In particular, the relation CC-part-of, as a relation between classes (we generally prefix relations between two classes with CC-, and relations that hold between two individuals with II-.), must be defined in terms of another relation II-part-of which is a relation between individuals [20, 25]. For example, CC-part-of can be defined as C ⊑ ∃II-part-of.D and CC-has-part as C ⊑ ∃II-has-part.D. Although these definitions are valid for many of the parthood relations asserted between classes in biological ontologies, they are inadequate schemata for parthood relations which have a GGP class as argument, because the GGP class is “too general”.
However, as a GGP class has several GGP-equivalent subclasses, the CC-has-part and CC-part-of relations may be valid for one of these classes but not for the others. For example, assuming the definition of CC-has-part above, asserting a CC-has-part relation between the GGP class CD19 (GGP) and CD19 promoter would be incorrect, because the GGP class will also include the CD19 protein class, which has no promoter as part (in virtue of being a class of proteins). Similarly, although it would be correct to assert that CD19 promoter CC-part-of CD19, it would be incorrect to say that CD19 CC-part-of CD19/CD21/CD81/Leu-13 complex. If the two statements above would hold, we could infer that CD19 promoter is CC-part-of the CD19/CD21/CD81/Leu-13 complex, which is incorrect because protein complexes have no promoters as part.
Consequently, we use the following alternative definition for the GGP-subclass-has-part relation (where the argument G
C
refers to a GGP class, and X to an arbitrary class):
In the OWL syntax, a disjunction of axioms is not permitted. Consequently, we have to reformulate the right side of the definition by using a single subclass axiom (where ⊥ refers to the OWL class owl:Nothing) and derive the equivalent definition:
Intuitively, this definition states that if the GGP class G
C
stands in the GGP-subclass-has-part relation to the class X, then either the DNA, RNA or Protein subclass of G
C
must stand in a CC-has-part relation to X. Using this pattern, we are further able to define the relation GGP-subclass-part-of by replacing II-has-part with II-part-of in definition 3.
II-part-of is a primitive relation and we assert axioms that hold for it. II-part-of is reflexive, transitive and antisymmetric. We define II-proper-part-of:
It is the II-proper-part-of relation which will provide the basis for the mereological relations within the GENIA, because identical (or co-extensional) classes are not annotated as standing in a parthood relation. Parthood relations that are not based upon location are further distinguished into two kinds in the GENIA relation annotation: a relation between components and the objects of which they are components, and membership in collections. We assume that the component-object relation (between individuals) II-oc-part-of is similar to the relation of determinate parthood [23] in that it is reflexive, transitive, antisymmetric and satisfies the strong supplementation principle [24]. Assuming these axioms for II-oc-part-of provides compatibility with the SO, which also assumes the axioms of extensional mereology for the entities classified by it [17, 26].
The member-component relation, on the other hand, is a relation between entities of different kinds and is neither reflexive nor antisymmetric [23, 27]. The II-member-of relation is a sub-relation of the II-proper-part-of relation and is non-reflexive, asymmetric and non-transitive [27]. II-member-of is not the same relation as the member-of relation in the SO; in the SO, member-of is transitive, while II-member-of is non-transitive. The relation GGP-subclass-member-of holds between a GGP class and a collection, such that for one of the subclasses of the GGP class, all instances are a member of some instance of the collection. Therefore, the same pattern as in definition 3 applies for the definition of GGP-member-of. For example, the Lck (GGP) class stands in the GGP-member-of relation to the protein family Src family, because there is a subclass of Lck (GGP), i.e., Lck protein, such that all instances of this subclass stand in an II-member-of relation to some instances of Src family. We do not provide a formal characterization of protein family here, but re-use the class from the GENIA term ontology and represent specific protein families (such as the Src family) as subclasses of GENIA’s Protein family class. A detailed formal characterization of Protein family within GENIA is subject to future work.
The third parthood relation used in the GENIA corpus annotations is GGP-subclass-region-of, which we define by using the primitive II-region-of relation. In the GENIA relation annotations, GGP-subclass-region-of is used to relate a GGP class to a genomic location. We introduce GGP-subclass-region-of to relate the GGP class to the class of loci. The region is a place where all instances of one subclass of the GGP class are located. As for the definition of GGP-subclass-has-part, GGP-subclass-part-of and GGP-subclass-member-of, we assume that there is a subclass of the GGP class for which all instances are located in some instance of the locus, and we use the same pattern as in formula 3. Next we define the interactions of II-region-of with II-part-of. We want to be able to infer that if the individual x is part of y, and y is located at z, then x is located at z. Furthermore, if the individual x is located at y and y is a part of z, then we infer that x is located at z. We state these conditions using the following axioms in OWL:
Objects and their variants
The second major group of GENIA corpus relations connects names of GGP classes to names of classes of their variants. Again, we formalize the relations that hold between the classes that are denoted by these names.
The GENIA annotations for GGP classes and their variants use six different relations to express the following relationships:
♦ GGPs to modified proteins, e.g., TR alpha 1 (GGP) to 35S-TR alpha 1 (Protein),
♦ GGPs to protein isoforms, e.g., ACTA1 (Protein) to G-Actin (GGP),
♦ GGPs to mutants, e.g., TNFRI (GGP) to dominant-negative mutant TNFRI (Protein),
♦ GGPs to recombinants, e.g., Oct-2 (GGP) to Oct-2 expression vector (DNA),
♦ GGPs to precursors, e.g., IL-16 (GGP) to pro-IL-16 (Protein),
♦ GGPs to experimental material, in particular to antisense elements, e.g., GATA-3 (GGP) to antisense GATA-3 RNA (RNA).
We call the basic relation between a GGP and its variant GGP-has-variant. There is a general schema involved in the sub-relations of GGP-has-variant that we exploit in its definition: whenever GGP-has-variant(G
C
, D), then every instance of D is a variation of some instance of G
C
. Although it is possible to identify a more specific subclass of G
C
in some cases, this is not true for all sub-relations of GGP-has-variant. We define the relation G
C
GGP-has-variant D by using the relation II-has-variant, which is a relation between individuals:
Again, we provide basic axioms for the II-has-variant relation. Our first observation is that variance is reflexive, i.e., everything (every molecule) is a variant of itself. Furthermore, variance is symmetric, i.e., if x is a variant of y, then y is a variant of x. Whether II-has-variant is transitive is more difficult to ascertain. While it seems to be the case that, if x is a variant of y and y a variant of z, then x is a variant of z, this principle may fail if the distance between x and z increases, i.e., more intermediate variants are introduced. Consequently, we do not assume that II-has-variant is transitive.
To formalize a sub-relation of II-has-variant, e.g., II-has-isoform, we note domain and range of the relation as well as basic axioms. In the definition of the GGP relation, we must carefully consider whether the relation holds between all instances of the GGP class, or only one of its subclasses. For example, the definition of GGP-has-isoform between G
C
and D is:
The relations GGP-has-recombinant, GGP-has-precursor and GGP-has-modified-protein follow the same pattern.
II-has-mutant is a relation between an instance of a GGP class and a mutant of this instance. The relation II-has-mutant is irreflexive and symmetric, and consequently not transitive. The definition of G
C
GGP-has-mutant D is as follows:
II-has-experimental-material relates an instance of a GGP class to experimental material such as an antisense element. The formal characterization is subject to future work and requires integration with ontologies of experiments such as the Ontology of Biomedical Investigations (OBI) [18].