Ontology design patterns to disambiguate relations between genes and gene products in GENIA
© Hoehndorf et al; licensee BioMed Central Ltd. 2011
Published: 6 October 2011
Skip to main content
Volume 2 Supplement 5
© Hoehndorf et al; licensee BioMed Central Ltd. 2011
Published: 6 October 2011
Annotated reference corpora play an important role in biomedical information extraction. A semantic annotation of the natural language texts in these reference corpora using formal ontologies is challenging due to the inherent ambiguity of natural language. The provision of formal definitions and axioms for semantic annotations offers the means for ensuring consistency as well as enables the development of verifiable annotation guidelines. Consistent semantic annotations facilitate the automatic discovery of new information through deductive inferences.
We provide a formal characterization of the relations used in the recent GENIA corpus annotations. For this purpose, we both select existing axiom systems based on the desired properties of the relations within the domain and develop new axioms for several relations. To apply this ontology of relations to the semantic annotation of text corpora, we implement two ontology design patterns. In addition, we provide a software application to convert annotated GENIA abstracts into OWL ontologies by combining both the ontology of relations and the design patterns. As a result, the GENIA abstracts become available as OWL ontologies and are amenable for automated verification, deductive inferences and other knowledge-based applications.
Documentation, implementation and examples are available from http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/.
The goal of Information Extraction (IE) is to recognize specific pieces of information in natural language texts and to represent them in a structured form that comprises meaningful associations of relevant entities. For this reason, IE approaches typically involve Named Entity Recognition (NER) where mentions of specific types of “real-world” entities, such as people or places, are detected in text. To facilitate reliable biomedical IE, considerable efforts have been made with regard to the development of specialized NER methods for key domain entities, focusing in particular on the recognition of gene and gene product (GGP) mentions [1–3]. As GGP mentions can further be normalized to identify specific entries in databases such as UniProt, they provide a connection to entities relevant to biomolecular research and thus a solid basis for domain IE. However, in contrast to the well-defined meaning of the basic entities, the semantics of their associations are often only informally defined.
In biomedical IE, extracted information is frequently represented simply as untyped pairs of entities representing, for instance, protein-protein or gene-disease associations . However, even resources identifying protein-protein interactions as entity pairs diverge considerably in their actual annotations , leading to restrictions ranging from usability to interpretability of both the annotations and IE results. In response to the limitations of such representations, there has recently been increased interest in richer representations of extracted information  and a number of corpora have been published that annotate associations between entities by using fine-grained types drawn from ontologies [7, 8]. Yet, no definition or axiomatization of these relations has been proposed so far. Definitions and axioms are necessary to make the meaning of the relations explicit, and to provide the means for developing consistent and verifiable annotation guidelines allowing for the automatic detection of inconsistent annotations and enabling the discovery of new information through deductive inferences. Here, our aim is to define such relations and axioms for fundamental relations such as part-of connecting GGPs to referents of non-specific domain terms such as promoter region. Annotations to these fundamental relations have been introduced recently [9, 10] to the widely used GENIA corpus .
Providing formal definitions and axioms for these relations is challenging because the relation annotations are based on the use of the relations in text, where it is generally not possible to enforce a common understanding of terms. We extend our preliminary work  and present a formal characterization of the relations used in the GENIA relationship annotation based on two ontology design patterns. These patterns are not restricted to an application within the GENIA corpus annotation, but can be applied in a wide number of domains, in particular in ontology- and knowledge-based applications using the categories of biological sequences, DNA, RNA or proteins. We implement the developed formalisms in OWL and provide a conversion software to represent annotated GENIA abstracts in OWL.
The GENIA corpus consists of 2,000 PubMed abstracts annotated manually by biomedical domain experts as a resource for the development and evaluation of domain information extraction (IE) methods. GENIA is one of the most widely used corpora for biomedical IE and has served as the basis for two community-wide shared tasks on named entity recognition  and event extraction . The annotations of the corpus abstracts include markup that identifies occurrences of domain terms and named entities, as well as statements of events involving these terms and entities [8, 11, 13]. The most recent addition to the corpus annotations covers relations between references to named entities and other domain terms .
An ontology is the formal specification of a conceptualization of a domain . A conceptualization is a system of classes accounting for a particular view on the world . Ontologies are used to specify the meaning of terms within a vocabulary. A basic ontological distinction is made between classes and individuals (or particulars). A class is an entity that can be predicated of other entities and that can have instances. The instance-of relation links instances to the class of which they are an instance. Some instances may be classes themselves and have further instances while an individual is an entity that cannot be further instantiated .
For the purpose of formalizing the relations used in the GENIA corpus, we make use of several biomedical domain ontologies: the Information Artifact Ontology (http://code.google.com/p/information-artifact-ontology/) (IAO), the Sequence Ontology (SO) , the Ontology of Biomedical Investigations (OBI) , the Gene Ontology (GO)  and the GENIA term ontology .
Relations in biomedical ontologies can be asserted both between classes and between individuals . Relations between individuals are used to define the relations between classes. These definitions may take the form of reusable patterns, and we will create such patterns for relations between classes in GENIA.
The first question we have to answer before we can formalize relations used in corpus annotation is what kind of entities are connected through relations in GENIA. Our first observation is that relations in corpus annotations are usually asserted between names and other biomedical domain terms, i.e., between strings that are identified as referring to some kind of entity. While the description of experiments in scientific publications will commonly refer to collections of individuals and not to classes , the goal of named entity recognition is, among others, the identification of the class to which the characterized collections belong. Therefore, we assume that the names identified in the GENIA corpus denote classes.
In some cases, there is ambiguity in determining the referent of a name or domain term, i.e., certain terms may not refer to identical entities, yet their referents are regarded as indistinguishable within the context of a task such as the annotation or recognition of named entities. Regarding certain referents as indistinguishable can improve the automatic extraction of relations and entities. The indistinguishability assumption also allows the definition of generic relations that hold between disjoint classes. Through these means, the effort to create annotation can be reduced, while the applicability of the relations in different tasks and the feasibility of automatic extraction can be maximized. Within GENIA annotations  and the NER systems based on it, genes and gene products are not distinguished.
Therefore, a basic precursor for our work is an equivalence relation which states that, within the context of a named entity annotation task, two classes are considered to be indistinguishable.
Names or terms referring to either a class of genes, DNA, proteins, RNAs and their splice variants, gene products, arbitrary transcripts or similar are considered to be equivalent within the context of the GENIA relation annotations. These classes are called genes/gene products (GGPs). For example, CD19, CD19 protein and CD19 gene may be considered to be equivalent and represent a single GGP.
Such a formalization has the benefit of connecting the different kinds of GGPs through formal relations that can be exploited by an automated reasoner.
For example, the name “CD19 protein” refers to a class of proteins, and instances of this class stand in a translated-from relation to instances of a class of RNA which may be referred to as “CD19 RNA”. Instances of this class of RNA stand in a transcribed-from relation to instances of a class of DNA which may be referred to as “CD19 gene”. Thus, according to our definition, all three classes are subclasses of the GGP class G CD 19.
The class-subclass relation is used to annotate the relation between terms or names in the GENIA corpus where one term refers to a more general class than the other term. For example, this relation holds between the names “CD19 human” (denoting the class CD19 human) and “CD19” (denoting a class that is indistinguishable from the class CD19 (GGP)). We base the definition of the class-subclass relation upon the ontological is-a relation : the classes C and D stand in the is-a relation, if and only if, every instance of C is also an instance of D.
For example, the referent of the name “human CD19 gene” (the class CD19 human gene) stands in the is-a relation to the referent of the name “CD19” (the GGP class CD19 (GGP)), because all instances of CD19 human gene are also instances of CD19 (GGP).
The largest group of relations in the relationship annotations of the GENIA corpus refers to mereological relations, i.e., relations between parts and their wholes. Three kinds of parthood relations are distinguished within GENIA:
relations between a whole and its components, for example between the classes CD19 promoter and CD19,
relations between a collection and its members, as between Hox gene family and HOXA1,
the relation between an entity and the location at which this entity exists, such as CD19 which is located at CD19 locus.
Substantial work has already been undertaken with regard to mereological relations and their representation in OWL and biomedical ontologies [20, 23, 24]. In particular, the relation CC-part-of, as a relation between classes (we generally prefix relations between two classes with CC-, and relations that hold between two individuals with II-.), must be defined in terms of another relation II-part-of which is a relation between individuals [20, 25]. For example, CC-part-of can be defined as C ⊑ ∃II-part-of.D and CC-has-part as C ⊑ ∃II-has-part.D. Although these definitions are valid for many of the parthood relations asserted between classes in biological ontologies, they are inadequate schemata for parthood relations which have a GGP class as argument, because the GGP class is “too general”.
However, as a GGP class has several GGP-equivalent subclasses, the CC-has-part and CC-part-of relations may be valid for one of these classes but not for the others. For example, assuming the definition of CC-has-part above, asserting a CC-has-part relation between the GGP class CD19 (GGP) and CD19 promoter would be incorrect, because the GGP class will also include the CD19 protein class, which has no promoter as part (in virtue of being a class of proteins). Similarly, although it would be correct to assert that CD19 promoter CC-part-of CD19, it would be incorrect to say that CD19 CC-part-of CD19/CD21/CD81/Leu-13 complex. If the two statements above would hold, we could infer that CD19 promoter is CC-part-of the CD19/CD21/CD81/Leu-13 complex, which is incorrect because protein complexes have no promoters as part.
Intuitively, this definition states that if the GGP class G C stands in the GGP-subclass-has-part relation to the class X, then either the DNA, RNA or Protein subclass of G C must stand in a CC-has-part relation to X. Using this pattern, we are further able to define the relation GGP-subclass-part-of by replacing II-has-part with II-part-of in definition 3.
It is the II-proper-part-of relation which will provide the basis for the mereological relations within the GENIA, because identical (or co-extensional) classes are not annotated as standing in a parthood relation. Parthood relations that are not based upon location are further distinguished into two kinds in the GENIA relation annotation: a relation between components and the objects of which they are components, and membership in collections. We assume that the component-object relation (between individuals) II-oc-part-of is similar to the relation of determinate parthood  in that it is reflexive, transitive, antisymmetric and satisfies the strong supplementation principle . Assuming these axioms for II-oc-part-of provides compatibility with the SO, which also assumes the axioms of extensional mereology for the entities classified by it [17, 26].
The member-component relation, on the other hand, is a relation between entities of different kinds and is neither reflexive nor antisymmetric [23, 27]. The II-member-of relation is a sub-relation of the II-proper-part-of relation and is non-reflexive, asymmetric and non-transitive . II-member-of is not the same relation as the member-of relation in the SO; in the SO, member-of is transitive, while II-member-of is non-transitive. The relation GGP-subclass-member-of holds between a GGP class and a collection, such that for one of the subclasses of the GGP class, all instances are a member of some instance of the collection. Therefore, the same pattern as in definition 3 applies for the definition of GGP-member-of. For example, the Lck (GGP) class stands in the GGP-member-of relation to the protein family Src family, because there is a subclass of Lck (GGP), i.e., Lck protein, such that all instances of this subclass stand in an II-member-of relation to some instances of Src family. We do not provide a formal characterization of protein family here, but re-use the class from the GENIA term ontology and represent specific protein families (such as the Src family) as subclasses of GENIA’s Protein family class. A detailed formal characterization of Protein family within GENIA is subject to future work.
The second major group of GENIA corpus relations connects names of GGP classes to names of classes of their variants. Again, we formalize the relations that hold between the classes that are denoted by these names.
The GENIA annotations for GGP classes and their variants use six different relations to express the following relationships:
♦ GGPs to modified proteins, e.g., TR alpha 1 (GGP) to 35S-TR alpha 1 (Protein),
♦ GGPs to protein isoforms, e.g., ACTA1 (Protein) to G-Actin (GGP),
♦ GGPs to mutants, e.g., TNFRI (GGP) to dominant-negative mutant TNFRI (Protein),
♦ GGPs to recombinants, e.g., Oct-2 (GGP) to Oct-2 expression vector (DNA),
♦ GGPs to precursors, e.g., IL-16 (GGP) to pro-IL-16 (Protein),
♦ GGPs to experimental material, in particular to antisense elements, e.g., GATA-3 (GGP) to antisense GATA-3 RNA (RNA).
Again, we provide basic axioms for the II-has-variant relation. Our first observation is that variance is reflexive, i.e., everything (every molecule) is a variant of itself. Furthermore, variance is symmetric, i.e., if x is a variant of y, then y is a variant of x. Whether II-has-variant is transitive is more difficult to ascertain. While it seems to be the case that, if x is a variant of y and y a variant of z, then x is a variant of z, this principle may fail if the distance between x and z increases, i.e., more intermediate variants are introduced. Consequently, we do not assume that II-has-variant is transitive.
The relations GGP-has-recombinant, GGP-has-precursor and GGP-has-modified-protein follow the same pattern.
II-has-experimental-material relates an instance of a GGP class to experimental material such as an antisense element. The formal characterization is subject to future work and requires integration with ontologies of experiments such as the Ontology of Biomedical Investigations (OBI) .
The BioTop Ontology  is derived from the GENIA term ontology and provides definitions and axioms for the classes in the GENIA ontology. Additionally, this ontology includes several relations. Some of these relations overlap with those used in the GENIA relation annotation and in the relation ontology, in particular the mereological relations. Yet, BioTop includes mostly the generic definitions of mereological relations. Thus, BioTop’s formalization of mereological relations cannot be used with respect to GGP, as their axioms do not always hold for GGPs as shown earlier. Furthermore, the BioTop ontology does not include any of the variance relations. As BioTop provides a rich axiom system for the classes of the GENIA term ontology, we aim at integrating the BioTop ontology with the relation ontology and the design patterns we provide in future work.
Another relevant ontology is the Gene Regulation Ontology (GRO) , which is an ontology for the domain of gene regulation. It provides axioms and definitions for the classes DNA, RNA and protein. Furthermore, it establishes relations between these classes. Therefore, it provides a means for a more detailed specification of GGP classes. GRO does not cover the relations formalized in this work. Rather, it could be allow to provide a more fine-grained definition of GGP classes if necessary.
There are several applications of formalized relations within the GENIA corpus:
development of unambiguous annotator guidelines,
verification of annotations,
inference of hidden knowledge and
abductive reasoning, inductive logic programming, rule learning.
Firstly, the development of clear annotator guidelines can be facilitated to increase inter-annotator consistency through the provision of less ambiguity. For this purpose, high expressivity is necessary to specify the meanings of relationship terms or other terms as precisely as possible. To proceed towards the goal of unambiguous, formal guidelines for corpus annotation, we used predicate logic for the formalization, and additionally associated our definitions and axioms with explanations in natural language.
Secondly, the axioms provide a means to verify annotations. Such a verification is made possible because axioms restrict the combinations of relations and may lead to contradictions which are sometimes automatically detectable. In particular, the OWL implementation of both the axioms and the ontology design patterns is amenable to automated reasoning and can be used to detect inconsistencies.
Additionally, it is possible to draw inferences from the asserted knowledge automatically. These inferences can be used to verify whether or not erroneous annotations have been asserted by identifying undesired or false inferences. Moreover, automatic inferences can be used to infer hidden or new knowledge.
The conversion tool we provide converts annotated GENIA abstracts into an OWL ontology. This conversion is a form of ontology induction or ontology generation. The resulting ontologies – each covering a domain described within one abstract – can be used for abductive or inductive logic programming, rule learning or other knowledge-based machine learning techniques.
To provide definitions for the relations between classes that are used in the GENIA corpus, we developed two closely related ontology design patterns . They are particularly suited for applications in text mining where the exact referent of a term cannot always be reliably determined. However, the patterns could be useful in other domains and applications as well.
The first ontology design pattern is applicable when a class C with the subclasses D1, ..., D n stands in a relation CC-R to a class E such that every instance of at least one subclass of C stands in a relation II-R to some instance of E. This pattern is useful when one class cannot be entirely disambiguated, and a superclass is used in a relation statement instead. For example, GGP classes in GENIA are primarily introduced because it is not always possible – or reasonable – to disambiguate entirely whether a term refers to DNA, RNA or Protein classes. Instead, the GGP class is used in relation statements, and the GGP class unifies the classes of DNA, RNA and Protein. In many cases, the relation is only relevant for the instances of one of the subclasses, e.g. only the Proteins, such that some property or relation applies to every instance of this subclass but not to all the instances of the other subclasses.
In general, it is possible to consider either an order defined on the relations T1, ..., T m or arbitrary permutations. Intuitively, the pattern is used to state that all instances of one general class (the GGP class in the case of GGP annotations) stand in a relation II-S to some instance of a class D or to any entity reachable by a chain (or permutation) of the relations T1, ..., T m from any instance of this class.
Although the formalization of relationships used in the GENIA annotation is itself valuable to provide a means for automated inference and verification as well as the development of annotation guidelines, formalized relations will be much more useful in combination with a formal characterization of events. Events include more dynamic entities such as the binding of a molecule to a binding site. In conjunction with the formalization of the relations, more useful inferences would become possible. For example, from the assertion that a class X binds Y which is a GGP-part-of Z, we would be able to infer that X GGP-binds Y.
We propose ontology design patterns that are not limited to relations between GGPs but can be applied in many domains. For example, the patterns can be used to formally distinguish between functions and the processes that realize them when using the functional abnormality pattern [33, 34]. We intend to explore further areas of application beyond the domain of genes and their products.
We presented and discussed a formal ontology-based characterization of the relations used for annotating the GENIA corpus. The main challenge was the ambiguity of the terms upon which the relations are based. These terms refer to one of several ontological classes, and the definitions of the relationships between two terms had to reflect that only one of these classes can stand in some relation to another class. To characterize this phenomenon formally, we introduced the notion of a GGP class, which is an ontological class with subclasses whose names are not distinguishable within a certain annotation task. In our GENIA use case, the GGP class is a common superclass for classes of DNA, RNA and proteins, and is intended to unify classes of genes and their products.
We introduced two ontology design patterns to formally define relations that hold between a GGP class and another class. The ontology design patterns are especially useful whenever it is not possible – or not feasible – to determine the exact class that stands in some relation to another class, and a more general class is chosen in a relation statement instead. Therefore, they can be generalized to other domains and applications besides corpus annotation.
We implemented the axioms and definitions as well as the ontology design patterns in a software application that converts annotated GENIA abstracts into OWL ontologies. These ontologies can then be used to answer queries, verify annotations or provide a basis for knowledge-based machine learning techniques. Formalizing the relations used in the relationship annotations of the GENIA corpus provides a powerful means to verify the annotations, to reason over them and to establish and communicate unambiguous and precise annotation guidelines. The ontology of relations, its axioms and our ontology design patterns are applicable and useful beyond GENIA. They can be integrated in other ontology- or knowledge-based resources whenever two classes are considered to be indistinguishable and need to be disambiguated through automated reasoning.
The research work in its first unrevised form was presented at the SMBM 2010, Hinxton, Cambridge, U.K.
This article has been published as part of Journal of Biomedical Semantics Volume 2 Supplement 5, 2011: Proceedings of the Fourth International Symposium on Semantic Mining in Biomedicine (SMBM). The full contents of the supplement are available online at http://www.jbiomedsem.com/supplements/2/S5.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.