Integration and publication of heterogeneous text-mined relationships on the Semantic Web
© Coulet et al; licensee BioMed Central Ltd. 2011
Published: 17 May 2011
Skip to main content
Volume 2 Supplement 2
© Coulet et al; licensee BioMed Central Ltd. 2011
Published: 17 May 2011
Advances in Natural Language Processing (NLP) techniques enable the extraction of fine-grained relationships mentioned in biomedical text. The variability and the complexity of natural language in expressing similar relationships causes the extracted relationships to be highly heterogeneous, which makes the construction of knowledge bases difficult and poses a challenge in using these for data mining or question answering.
We report on the semi-automatic construction of the PHARE relationship ontology (the PHArmacogenomic RElationships Ontology) consisting of 200 curated relations from over 40,000 heterogeneous relationships extracted via text-mining. These heterogeneous relations are then mapped to the PHARE ontology using synonyms, entity descriptions and hierarchies of entities and roles. Once mapped, relationships can be normalized and compared using the structure of the ontology to identify relationships that have similar semantics but different syntax. We compare and contrast the manual procedure with a fully automated approach using WordNet to quantify the degree of integration enabled by iterative curation and refinement of the PHARE ontology. The result of such integration is a repository of normalized biomedical relationships, named PHARE-KB, which can be queried using Semantic Web technologies such as SPARQL and can be visualized in the form of a biological network.
The PHARE ontology serves as a common semantic framework to integrate more than 40,000 relationships pertinent to pharmacogenomics. The PHARE ontology forms the foundation of a knowledge base named PHARE-KB. Once populated with relationships, PHARE-KB (i) can be visualized in the form of a biological network to guide human tasks such as database curation and (ii) can be queried programmatically to guide bioinformatics applications such as the prediction of molecular interactions. PHARE is available at http://purl.bioontology.org/ontology/PHARE.
A large amount of biomedical knowledge is in the form of text embedded in published articles, clinical files or biomedical public databases. In order to construct computable knowledge bases from these sources, there is a great interest in capturing and formalizing this knowledge. The capture of relationships between biological entities is of particular interest since such relationships represent elementary and reusable knowledge units—often called “nano-publications” .
Our work is motivated by the need for automated approaches capturing and formalizing knowledge extracted from the literature via manual or computational approaches. Consider for example, that five curators at the Pharmacogenomics Knowledge Base (PharmGKB) manually browse the pharmacogenomics (PGx) literature to curate relationships relevant for storage in the PharmGKB . The result of this curation process is a high quality database queried by clinicians and bioinformaticians. Nevertheless this manual curation process is not sustainable considering the growth of the scientific literature in this domain . Automatic approaches using Natural Language Processing (NLP) are therefore increasingly utilized .
The simplest methods to capture relationships rely on co-occurrence of two entities to derive a relation between them. For example, in the sentence “Our study shows that warfarin inhibits the expression of VKORC1” a drug, warfarin, and a gene, VKORC1, can be recognized using simple lexicons. The co-occurrence of these two entities in one or more sentences is used to derive a relation of the form (warfarin, VKORC1).
One key limitation of the co-occurrence based approach is identification of false positive connections. For example the sentence “Warfarin inhibits the expression of VKORC1 while sulfamethoxazole inhibits the expression of CYP2C9” would provide co-occurrence counts towards four relationships including the relationships (warfarin, VKORC1) and (warfarin, CYP2C9); only one of which is true. A second limitation is the coarse granularity of the identified relationships. Considering the previous example, the mentioned relationship links warfarin and the expression of VKORC1, and not VKORC1 per se. We consider this distinction of importance since VKORC1 and expression of VKORC1 refer to a gene and a phenotype respectively—two very distinct entities. Despite these limitations, co-occurrence is successfully used to generate networks including protein-protein interaction networks, gene-disease networks and regulatory gene expression networks [5, 6]. Most of these networks are hard to compute on since their representation format does not support queries with typed relationships and the semantics associated with the nodes and edges differ in every network.
In previous work, we described the extraction of over 40,000 raw relationships in the domain of pharmacogenomics from MEDLINE abstracts . In following sections we briefly summarize this extraction process and then describe how we use the PHARE ontology we have created to normalize and integrate these relationships.
We define a relationship as a binary relation R (a, b), where a, and b are subjects and objects related by a relationship of type R. In PGx relationships a and b can be instances of a gene (e.g., VKORC1 gene), drug (e.g., warfarin), or phenotype (e.g., clotting disorder). We note that a and b can also be entities that are related to genes (e.g., VKORC1 expression), drugs (e.g., warfarin dose) or phenotypes (e.g., clotting disorder treatment). R is a type of relation described by words such as “inhibits”, “transports”, or “treats” and their synonyms.
Given the definition of PGx relationships, a sentence that potentially contains a PGx relationship would mention a gene and drug, a gene and a phenotype, or a drug and a phenotype. We used a Lucene index created on individual sentences of MEDLINE abstracts published before 2009 (17,396,436 abstracts and 87,806,828 sentences) processed by Xu et al. to identify those sentences that might contain a PGX relationship [12, 13]. To select only sentences that potentially mention a PGx relationship we queried the index with pairs of key PGx entities (only gene-drug and gene-phenotype pairs) for sentences that are indexed with both the terms in the query. The PharmGKB lexicon, provides the sets of synonyms used to build such queries for the key entities. Overall, for this study we used 41 genes highlighted by PharmGKB as key, well characterized pharmacogenomic genes , as well as 3,007 drugs and 4,202 phenotypes. Future work will expand the relationship extraction to all genes.
Sentences returned by the index are parsed using the Stanford Parser to build Dependency Graphs (DGs) . DGs are rooted, directed, and labelled graphs, where nodes are words and edges are dependency relations between words (e.g., noun modifier, nominal subject). The extraction of raw relationships of the form R(a,b) relies on the exploration of syntactic structure provided by DGs where:
R is a node in the DG that connects a and b, and indicates the nature of their relationship.
The entity hierarchy is defined with the subsumption relation (noted as ⊑ or subClassOf in OWL). Existential quantification is used to define sets of composite entities that are only modified by certain concepts. For example the set of entities that are modified by drugs is defined with the existential quantifier (Ǝ) and the role modified by: Ǝ modified.Drug (or modified someValuesFrom Drug in Manchester OWL syntax), see Figure 4 for examples. This definition is associated through a subsumption relation to entities that can be modified by drugs, such as DrugSensitivity. This pattern is used to distinguish what thing is specialized (or modified) by drugs from what is specialized by other modifiers (e.g. disease names). For example warfarin that we know to be a drug enables us to distinguish warfarin sensitivity from cancer sensitivity and to classify warfarin sensitivity as a kind of drug sensitivity versus disease sensitivity (represented by the DiseaseSensitivity concept).
Class declarations are used to list all key entities of the domain of interest and what entity type they belong to. In our case, where gene-drug relationships are studied, known drugs and genes must be defined in the ontology as being an instance of the entity types Drug and Gene.
In order to quantify the utility of manual review and editing of the raw relationships in building PHARE, we built a second ontology named WN-PHARE in a purely automated manner using the lexical resource WordNet . In this case all relationship types—and not just the 200 most frequent ones—are computationally merged in groups according to WordNet synsets. Resulting groups are directly used to define roles without any manual review. Similarly, all terms that modify gene, drug or phenotype names are merged in groups used to define composite entities.
The algorithm to normalize typed relationships between composite entities consists of four steps. The first three steps normalize the subject entity, the object entity, and the relationship type. The last step, assembles the three normalized pieces in a normalized relationship of the kind shown in Figure 1.
The next step is to normalize the relationship type. The ontology is searched for role labels that match the raw relationship. When a match is found, the preferred name of the corresponding role is used to normalize the relationship type. Note that during this step the normalization process distinguishes between passive voice of the present tense, such as “A is inhibited by B” and active voice of simple past tense “B inhibited A”. Dependency Graphs of these two sentences are different because “inhibited” in the passive voice sentence is related through an aux dependency to “is” (standing for auxiliary). This difference is used during the relationship extraction to extract either is$$inhibited(A, B) or inhibited(A, B).
The final step is to group together normalized composite entities and relationship type to produce normalized relationships. For each relationship, this step relies on the simple assembly of normalized type, subject and object. In addition if the role used to normalize the type has inverses or is symmetric then this step also creates the appropriate additional relationships. For each inverse role in the ontology, an inverse relationship is created with the preferred name of the inverse and where normalized subject and object are swapped. If the role is symmetric, one additional relationship is created with the same normalized relationship type but with subject and object swapped. Figure 5 illustrates the integration process that applies such relationship normalization on four heterogeneous sentences.
Applying the normalization on raw relationships produces a set of relationships represented as PHARE entities and roles. Consequently normalized relationships can be directly added to PHARE as instances to create a knowledge base.
Raw relationships have been normalized twice using PHARE to iteratively refine the ontology. After the first iteration of the normalization, from the pool of un-normalized relationships we manually identify terms and roles that are either frequent or of PGx interest. Such terms (or roles) are then used to extend the set of synonyms of an entity already defined in the ontology, or used to create a new entity in the ontology.
The PHArmacogenomic RElationship ontology (or PHARE) contains 229 entity classes and 76 roles of interest in the PGx domain. PHARE is encoded in OWL-DL and is constructed semi automatically by (i) listing terms derived from relationships extracted automatically from text ; and (ii) the manual organization of the relationship terms by domain experts. Figures 2 and 3 illustrate how the extracted terms are organized in these hierarchies. The PHARE ontology is available online at http://purl.bioontology.org/ontology/PHARE.
The ontology-driven integration process described in the method section takes as input a set of relationships extracted from MEDLINE abstracts and outputs a set of normalized relationships of the form Role(subject, object) represented using entity types and roles defined in PHARE. Therefore, normalized relationships can be used to instantiate roles defined in PHARE without additional processing. We performed such instantiation and obtained the PHARE-Knowledge Base (or PHARE-KB) that contains 28,676 roles instantiations encoded as RDF triples from over 41,000 raw relationships. If we consider instantiation of role inverses (e.g., isInhibitedBy (a,b) ≡ inhibits-1 (b,a)), the number of role instantiations rises to 46,526. Note that some roles in PHARE do not have inverse or are symmetric (e.g., isAssociatedWith).
Almost 77% role instantiations use roles initially encoded in PHARE and 23% necessitate the creation of new roles in PHARE. In other words PHARE roles are sufficiently detailed to capture 77% of the relationships we extracted from text analysis. New roles correspond to types of relationships that are not frequent enough in our corpus and consequently have not yet been manually reviewed and defined in PHARE. These roles, which are added solely to instantiate the 23% of un-normalized relationships are associated with only one, label and thus do not yet contribute to the integration of relationships.
The 28,676 role instances link roughly 16,000 individuals of the KB, including 285 genes, 1,083 drugs and 990 diseases. To facilitate overlap comparisons of PHARE-KB with other data sources individuals that are of type genes, drugs, or diseases are associated with their Entrez Gene, DrugBank, and MeSH identifiers respectively.
Individuals in the PHARE-KB can be classified using reasoning. Classification allows us to make the implicit knowledge units explicit. For example, classification infers that
i.e., VKORC1 expression is a phenotype
on the basis of the following two axioms
Expression ⊑ Phenotype
i.e., VKORC1 expression is a gene expression and gene expression is a phenotype.
Every relationship available in the PHARE-KB (in the form of a RDF triple) is associated with its provenance using the property rdfs:comment. For example, the triple isAssociatedWith(UCHL1, parkinson disease) is associated with the following string: ”[14522054, Neuronal ubiquitin C-terminal hydrolase (UCH-L1) has been linked to Parkinson's disease (PD), the progression of certain nonneuronal tumors, and neuropathic pain]”, Where 14522054 is the PMID (PubMed ID) of the article and the text is the sentence based on which the triple is created.
Comparison of PHARE and WN-PHARE
Number of entity types
Number of roles
Labels per entity type
Labels per role
Comparison of the identification of similar relationships
Raw relationships (no normalization)
Relationships normalized with
Number of relationships identified n times
2 ≤ n <5
5 ≤ n <10
n ≥ 10
In order to publish the PHARE-KB for use on the Semantic Web, we set up a SPARQL endpoint, which is available at http://sparql.bioontology.org/webui/. Examples of queries are provided as additional file 1.
The KB is classified and inferred triples are materialized before loading into the triple store underlying the SPARQL endpoint. As a consequence queries return asserted as well as inferred facts.
An example of query for entities related to the uchl1 gene is shown below:
SELECT $y $z
WHERE <http://www.stanford.edu/~coulet/phare.owl#uchl1> $y $z;
This query returns the RDF triple isAssociatedWith(UCHL1, parkinson disease) mentioned previously. Queries can also return sets of RDF triples that are used to build sub-network related to a specific diseases as shown in Figure 7.
Figures 7 and 8 show gene-disease sub-networks related to AD and PD respectively. For display purpose, these have been reduced by selecting only those nodes that are asserted to be related in more than 5 different sentences. Since the type of relationship differ in sentences, only the two most frequent relationships are displayed as labels on the edges. Each network was obtained using a SPARQL query to select triples where the disease (AD or PD) is either subject or object. Resulting set of triples is then filtered to keep the frequent relationships. Such filtering enables to us remove both false positives as well as irrelevant triples such as phare:alzheimer=disease rdf:type phare:Disease . Note that in RDF we use the symbol ‘=’ as a simple separator to replace spaces in coumpound nouns.
Our work is motivated by the need for automated approaches capturing and formalizing knowledge extracted from the literature and the need for publishing such knowledge on the Semantic Web. Recent advances in Natural Language Processing (NLP) techniques enable the extraction of fine-grained relationships mentioned in biomedical text . The variability and the complexity of natural language in expressing similar or simple relationships causes the extracted relationships to be highly heterogeneous. We show that the use of a relationship ontology can normalize and integrate the heterogeneous relationships extracted from text and serve as a common semantic framework to integrate text-mining derived facts into a knowledge base. However, the manual construction of a relationship ontology is a slow and expensive process . We have devised a method to construct such an ontology using the text-extracted heterogeneous relationships as a starting point. Although we only report on our experiments in the pharmacogenomics domain; we note that the approach described here can be applied for relationship extraction in other domains.
Our results in publishing RDF triples extracted from text align closely with the objectives of the Linking Open Data community project  and that of efforts such as the Concept Web Alliance . The goal of projects such as Linked Open Data is to publish various data sets as RDF on the Web and to declare links between data items from different data sources.
Currently, the relationships we extract do not integrate easily with content in the Link Data Cloud for two main reasons: the lack of resource unique identifiers and the lack of an agreed upon relation ontology. Despite community efforts to create unique resource identifiers for life sciences, currently there is no clear consensus [21, 22]. In addition, composite entities, such as VKORC1 expression that participate in relationships are too complex to reference using a single identifier. Moreover, the absence of an expressive and comprehensive relation ontology led us to develop our own in a boot-strapped manner from example instances of text-mined relationships. PHARE is designed for the purpose of representing PGx relationships and we anticipate that sharing it with the community will provide a much needed example set for the development of a proper, formal biomedical relation ontology. PHARE is particularly suited to seed that activity, because it is built from the most frequent relationships that are used in the scientific literature. One challenge is thus to propose consistent mappings between relationship types arising from the literature, such as those suggested by PHARE and relationship types arising from functional annotations such as “suppresses gene” or “enhances gene” suggested by TAIR relations or the Gene Ontology .
Adequately representing provenance information at the sentence level is a challenge. Currently, we utilize the rdfs:comment property to store provenance for each extracted fact in PHARE-KB. In the future, we plan to evaluate the Annotation Ontology developed by Ciccarese et al.  for its utility is representing provenance at the sentence level, particularly in workflows where both automated and manual approaches are used simultaneously.
Another limitation is the incoherence between gene name identifiers across data sources. Our gene identifiers are based on PharmGKB gene names that are not entirely consistent with the HUGO Gene nomenclature , making cross referencing with other sources time consuming. In a similar vein, recall for extracted relations may improve upon using advanced Named Entity Recognition such as disambiguation techniques rather than the current PharmGKB-derived dictionary based approach.
The efficacy of the relationship normalization and integration might vary depending on the source of the text such as full articles, clinical reports, clinical files or drug labels. However, because PHARE has been designed using MEDLINE abstracts, it may capture relationships mentioned in diverse sources.
We have described the construction of an ontology of relationships in the PGx domain and its use to integrate heterogeneous relationships extracted by text-mining. The synonyms, entity descriptions, and the hierarchies of entities and roles represented in the ontology are used to map text-derived relationships to the ontology. Once mapped, relationships can be normalized and compared using the semantics defined in the ontology to identify relationships that have similar semantics but different syntax. We compare and contrast a fully automated and a manually edited version of the PHARE ontology to quantify the degree of integration enabled by manual inspection, curation and refinement of the PHARE ontology. PHARE has been successfully used in a pipeline for the integration of pharmacogenomic relationships extracted from MEDLINE abstracts . The result of the integration is compiled into a knowledge base named PHARE-KB, which can now be queried using Semantic Web technologies such as SPARQL and can be visualized in the form of a biological network. PHARE-KB can also be queried programmatically, for example, to guide computational prediction of molecular interactions .
Named Entity Recognition
Natural language Processing
Web Ontology Language
Web Ontology Language, Description Logic Kind
Resource Description Framework
SPARQL Protocol And RDF Query Language.
This work was supported in part by the National Center for Biomedical Ontologies, under roadmap-initiative grant (U54HG004028) from the NIH and by the PharmGKB (GM61374), with computing cluster support from the NSF (CNS-0619926).
This article has been published as part of Journal of Biomedical Semantics Volume 2 Supplement 2, 2011: Proceedings of the Bio-Ontologies Special Interest Group Meeting 2010. The full contents of the supplement are available online at http://www.jbiomedsem.com/supplements/2/S2.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.