Development and application of an interaction network ontology for literature mining of vaccine-associated gene-gene interactions

Background Literature mining of gene-gene interactions has been enhanced by ontology-based name classifications. However, in biomedical literature mining, interaction keywords have not been carefully studied and used beyond a collection of keywords. Methods In this study, we report the development of a new Interaction Network Ontology (INO) that classifies >800 interaction keywords and incorporates interaction terms from the PSI Molecular Interactions (PSI-MI) and Gene Ontology (GO). Using INO-based literature mining results, a modified Fisher’s exact test was established to analyze significantly over- and under-represented enriched gene-gene interaction types within a specific area. Such a strategy was applied to study the vaccine-mediated gene-gene interactions using all PubMed abstracts. The Vaccine Ontology (VO) and INO were used to support the retrieval of vaccine terms and interaction keywords from the literature. Results INO is aligned with the Basic Formal Ontology (BFO) and imports terms from 10 other existing ontologies. Current INO includes 540 terms. In terms of interaction-related terms, INO imports and aligns PSI-MI and GO interaction terms and includes over 100 newly generated ontology terms with ‘INO_’ prefix. A new annotation property, ‘has literature mining keywords’, was generated to allow the listing of different keywords mapping to the interaction types in INO. Using all PubMed documents published as of 12/31/2013, approximately 266,000 vaccine-associated documents were identified, and a total of 6,116 gene-pairs were associated with at least one INO term. Out of 78 INO interaction terms associated with at least five gene-pairs of the vaccine-associated sub-network, 14 terms were significantly over-represented (i.e., more frequently used) and 17 under-represented based on our modified Fisher’s exact test. These over-represented and under-represented terms share some common top-level terms but are distinct at the bottom levels of the INO hierarchy. The analysis of these interaction types and their associated gene-gene pairs uncovered many scientific insights. Conclusions INO provides a novel approach for defining hierarchical interaction types and related keywords for literature mining. The ontology-based literature mining, in combination with an INO-based statistical interaction enrichment test, provides a new platform for efficient mining and analysis of topic-specific gene interaction networks.


Background
Two common strategies of literature retrieval of reported gene-gene interactions include gene-gene cooccurrence and interaction keywords-based literature mining. In this paper, the gene-gene interaction represents a broad interactive relation between two genes or gene products [1]. Such a relation does not have to be a direct physical interaction. The co-occurrence strategy identifies two related genes both listed in the same literature, or more specifically in the same title, abstract, or sentence. An example of such a strategy is PubGene, which extracts gene relationships based on the co-occurrence of gene symbols in MEDLINE titles and abstracts [2]. The other strategy relies on the identification of two genes together with an interaction keyword in the same sentence. Such a method may still generate many false-positive results. To improve the interaction keyword-based approach, machine learning algorithms (e.g., support vector machine (SVM) [3]) with features extracted from syntactic analysis of sentences (e.g., dependency parse trees) can be used [4].
Ontologies can be applied to enhance literature mining performance. For example, in our previous work, a vaccine-specific sub-network was built by considering only the interactions that were extracted from sentences that contain the "vaccine" term (or its variants like "vaccines", "vaccination", and "vaccinated"). This strategy does not retrieve the sentences where more specific vaccine names such as BCG (a commercial tuberculosis vaccine) are mentioned. Such vaccine names and their hierarchical relations are represented in Vaccine Ontology (VO) [5]. We found that the application of VO has significantly improved the analysis of the vaccine-specific sub-networks [6].
An ontology that logically represents various interaction keywords/types and their semantic relations would help address the challenge of retrieving and classifying the types of gene-gene interactions in the interaction keyword-based literature mining. The GENIA ontology provides a semantically annotated corpus for biological literature mining [7]. However, this ontology does not specify various types of interactions between genes or proteins. Initiated from the classification of >800 interaction keywords [6], we have developed the Interaction Network Ontology (INO) that ontologically represents various interaction types and their relations, and collects and assigns interaction keywords to these different interaction types. The details about the ontology will, for the first time, be provided in this manuscript.
In addition to supporting the literature mining of gene-gene interactions, INO can be used for interaction type enrichment analysis. Gene Ontology (GO)-based gene set enrichment analyses have been widely used to determine over-or under-represented biological functions in a set of genes obtained from high-throughput Omics studies. GO provides controlled vocabulary of standard terms for describing gene product characteristics in a hierarchical structure. The input to the GO term enrichment analysis is a list of genes. Such a method does not classify enriched gene-gene interactions. Since INO classifies different interaction types into a structured ontology, it becomes possible to perform a gene-gene interaction enrichment study by comparing the INO-based literaturemined data of gene-gene interactions in some specific domain over the data from the broad background.
In this manuscript, we will first introduce the development of INO with a focus on its representation of interaction types and keywords for literature mining. An INO-based gene interaction enrichment method based on a modified Fisher's exact test will then be introduced. We applied our approach to the analysis of the vaccinemediated gene-gene interactions. The resulting overand under-represented gene-gene interaction types and gene-gene interactions will also be described in detail.

INO development
INO was developed by following the Open Biological Ontology (OBO) Foundry ontology development principles, including openness and collaboration [8]. Its development is aligned and integrated with existing OBO Foundry library ontologies. INO imports existing terms by using OntoFox [9]. New terms generated in INO

INO-based literature mining of gene-gene interaction pairs and interaction types
The sentences from the complete PubMed abstracts (published up to 12/31/2013) were obtained from the BioNLP database in the National Center for Integrative Biomedical Informatics (http://ncibi.org/). Our in-house literature mining tools, SciMiner [11] and VO-SciMiner [12], were used to identify gene names/symbols and VO and INO terms (interaction keywords) from these sentences. Sentences with two gene names and at least one INO term (e.g., interacts, binds, activates) were selected.
We obtained the dependency parse trees of the sentences using the Stanford Parser [13] and extracted the shortest dependency path between each pair of genes in a sentence. We defined an edit distance-based kernel function among these dependency paths and used SVM [3] to classify whether a path describes an interaction between a gene pair [6]. A confidence score calculated based on SVM was used to measure the confidence of association between two genes in a sentence in the literature. Positively-scored sentences were kept, and the gene pairs together with the interaction keywords from these sentences were extracted. The extracted interaction keywords were mapped to INO to define the interaction types.

Development of INO-based statistical enrichment analysis of literature mined gene-gene interaction data
A modified Fisher's exact test has gained popularity over the last decade in high-throughput gene expression studies as a preferred method for identifying enriched biological functions among given gene sets [14,15]. We implemented the modified Fisher's exact test in Perl using the Ngram Statistics Package [16] to identify enriched gene-gene interaction types, in terms of INO terms, within a concept-specific sub-network. For each INO term, a 2×2 contingency table is obtained on which the Fisher's test runs, as shown in Table 1. Both significantly under-represented and over-represented terms are selected as a significantly enriched INO term with a p-value < 0.05 after Benjamini-Hochberg (BH) multiple testing corrections. Here a significantly over-represented or under-represented term indicates that the term was significantly more or less frequently used in the vaccine context compared to the whole literature background. In the current study, a vaccine-associated gene-gene interaction network was defined based on the gene-gene interactions obtained from the PubMed abstracts, including those retrieved by a PubMed search of 'vaccine' and those identified by VO-SciMiner using 186 specific vaccine terms extracted from the VO 'vaccine' branch. These 186 vaccine terms (e.g., tuberculosis vaccine BCG) are easily identified by natural language processing programs. This vaccine-associated network was compared against the complete gene-gene interaction network.

The Interaction Network Ontology (INO) (1) INO overall design and hierarchy
INO is a biomedical ontology in the domain of molecular interactions and interaction networks. INO is aligned with the upper-level Basic Formal Ontology (BFO) [17] ( Figure 1). BFO contains two branches, continuant and occurrent. The continuant branch represents timeindependent entities such as material entity, and the occurrent branch represents time-related entities such as process. BFO has currently been used by over 100 domain ontologies, including many (e.g., GO) within the framework of the OBO Foundry [8]. By aligning different domain ontologies under the two branches of BFO, INO is able to efficiently use the terms from other ontologies in representing signaling pathway elements.
Three important INO terms are interaction, network, and pathway. In INO, an interaction is defined as a processual entity that has two or more participants (i.e., interactors) that have an effect upon one another under a particular condition. An interactor (or called interactant) is defined as a material entity that plays the role of "interactor role". With different roles, an interactor can be an 'input interactor', 'output interactor', 'catalyst', 'positive regulator', or 'negative regulator'. An interaction consumes its input interactors (but not the catalysts or regulators) and generates its output interactors. A network is a process that includes at least two connected interactions. A network does not have to include a predefined start or end entity. A pathway is a type of network that has specified distinct start(s) and end(s). Each of these three INO terms includes many subclasses. Therefore, in addition to the representation of various interaction types, INO has also been developed to represent pathways and networks. Furthermore, INO has been used as a species-neutral ontology core and platform for generating human-specific interaction network ontology (HINO) [18,19]. Since the scope of this manuscript is the ontology-based literature mining of gene-gene interactions, we will primarily focus on the ontological representation of interactions in INO.
INO imports terms from other ontologies, particularly from the Proteomics Standard Initiative-Molecular Interaction (PSI-MI), which is a standard molecular interaction data exchange format established by the Human Proteome Organization (HUPO) Proteomics Standard Initiative (http://www.psidev.info). Their PSI-MI format has been widely used in the proteomics community and PSI-MI is also an OBO Foundry library ontology. To be compatible with PSI-MI, we have imported the branch of the 'interaction type' (MI_0190) to INO (Figures 1 and 2).
Compared to PSI-MI, GO Biological Processes (BP) branch often has more detailed subclasses (or subtypes) to specific interaction types. Using more general PSI-MI terms (e.g., PSI-MI 'lipid addition') as parent terms, INO has imported many specific GO subtypes of interactions (e.g., GO 'protein myristoylation') to INO as subclasses of the MI-based interaction terms ( Figure 1). As a specific example, we have imported GO 'protein myristoylation' and all of its GO subclasses to INO ( Figure 2). The GO term 'protein myristoylation' has been used to replace the PSI-MI term 'myristoylation reaction'. It is noted that the top level GO Biological Processes hierarchy is not used because many biological processes (e.g., 'metabolic process') in GO are not 'interaction' per se and thus cannot be imported to INO for interaction representation. While PSI-MI focuses on direct protein-protein interactions, it does not include many other interaction types such as regulation types. Therefore, INO also includes interaction terms that are out of current PSI-MI scope, especially different regulation types (Figure 1). Many of these interaction types were generated by classifying the over 800 interaction keywords used in our previous literature mining studies [1,6].
(2) Literature mining support in INO The over 800 interaction keywords used in our previous literature mining studies [1,6] do not correspond to the same number of interaction types. While an interaction type or term in INO has its ontology ID, such a term may be associated with different synonyms or related keywords that can be used for literature mining. To support identification of genetic interactions in literature, synonyms and related keywords are needed. To meet this need, we have generated an annotation property called 'has literature mining keywords' (Figure 2), which allows the listing of different keywords mapping to the interaction type.
For example, the term 'protein myristoylation' in INO has five related literature mining terms including 'myristoylate', 'myristoylates', 'myristoylated', 'myristoylating', and 'myristoylation'. These term variations are listed as an annotation of the interaction type using the annotation property 'has literature mining keywords' (Figure 2). The list of keywords can be easily extracted from the ontology by SPARQL or other methods and used for literature mining. ontologies, INO also has imported terms from other authoritative domain ontologies such as the Chemical Entities of Biological Interest (ChEBI) [20] and the Ontology of Genes and Genomes (OGG) [21]. Provenance and source ontology IDs are kept in our term importing [9]. The detailed INO term statistics can be found on the Ontobee INO statistics website (http:// www.ontobee.org/ontostat.php?ontology=INO).
Particularly, under the branch of INO interaction, INO includes a total of 355 terms. In addition, approximately 700 keywords are defined using the annotation property 'has literature mining keywords'. These INO interaction terms and their associated literature mining keywords can be used for efficient literature text tagging and retrieval of sentences containing these keywords. The usage of these terms and keywords in our literature mining study is described below.

INO-based literature mining of gene-gene interactions (1) Workflow and system design
The workflow of the ontology-based gene pair enrichment analysis is illustrated in Figure 3. Specifically, all publications from PubMed were first downloaded. The sentences of article titles and abstracts were parsed and pre-processed. Human gene names and interaction keywords were tagged. To tag human gene names, the HUGO human gene nomenclature assignments (http:// www.genenames.org/) were used. These human gene names are also available in the OGG [21]. The INO interaction types and associated keywords were used for tagging interaction keywords. As detailed in the Methods section, an INO-based modified Fisher's exact test was developed to identify statistically significantly enriched gene-gene interaction types and associated gene-gene pairs (Figure 3).  We applied our study to the vaccine domain. A PubMed search for vaccine-related documents resulted in 237,061 hits (as of 12/31/2013). VO-SciMiner additionally identified 28,908 documents using VO terms, resulting in a total of 265,969 documents to define the vaccine-associated document sets. The gene-gene interactions (i.e., gene pairs) with positive SVM scores and at least one INO term at the same sentence level were compiled from these 265,969 PubMed abstracts. A total of 6,116 gene pairs were associated with at least one INO term.
Out of 78 INO interaction terms associated with at least five gene-pairs of the vaccine-associated subnetwork, 14 terms were significantly over-represented (Benjamini-Hochberg (BH) p-value < 0.05 and a minimal enrichment fold of 2) ( Table 2). The results indicate that these 14 interaction types are more extensively studied in the vaccine context among the research of all the gene-gene interaction types published in PubMed.
Furthermore, our gene-gene interaction enrichment analysis was able to retrieve all the gene pairs associated with each interaction type (last column in Table 2). For example, as indicated in five publications (PubMed IDs: 19915058, 8557339, 15557182, 17517055, and 7525727), the cytokines interferon-gamma (IFNG) and interleukin-12A (IL12A) have been found to be closely related, and the neutralization of one cytokine often leads to decreased production of another one [22,23]. Such neutralization-related research is typically found in the field of vaccinology. In another example, associated with the interaction type "induction of production", the production of one cytokine, TNF (or IFNG), was found to be induced by another cytokine, IFNG (or TNF) [24]. A  close examination of all the gene pairs recorded in Table 2 shows that they are all related to the vaccine and immunology research. These results also confirm the specificity of our INO-based enrichment analysis. In addition, our study found 17 significantly underrepresented INO terms with a maximum enrichment fold of 0.5 (equivalent to 2 fold in over-representation) and BH P-value < 0.05 (Table 3). Compared to the general gene-gene interaction research, these interaction types are likely less studied in the vaccinology research field. The reasons of these under-represented interaction types may vary. It is likely that some of these underrepresented interactions represent new research opportunities in the vaccinology domain.
One advantage of INO-based study is that we can rely on the INO hierarchy to identify the relations among enriched interaction types. Such a strategy is used to generate the hierarchies of enriched 14 over-represented and 17 under-represented INO interaction types (Figure 4). This study clearly shows the relations between many different interaction terms. For example, among the three over-represented terms, 'mRNA cleavage', 'RNA cleavage', and 'nucleic acid cleavage', there are two parent-child relations as clearly shown in Figure 4. Interestingly, the term 'cleavage reaction' is one of the 17 under-represented terms (Table 3). It is noted that the more general term 'cleavage reaction' is the parent term of 'nucleic acid cleavage', which is the parent term of 'RNA cleavage' (Figure 4). The term 'RNA cleavage' has a child term 'mRNA cleavage'. Besides these cleavage types, there are many other specific 'cleavage reaction' types, for example, protein cleavage, DNA cleavage, and lipid cleavage. In our calculation of the parent term 'cleavage reaction', we included all its child terms. Therefore, the underrepresented 'cleavage reaction' indicates that the whole category of cleavage reaction is under-represented although the above three specific reaction types are overrepresented.
Both sets of over-represented and under-represented interaction terms share some common top-level terms including 'regulation', 'direct interaction', 'association', and 'interaction'. Otherwise, specific profiles of the two sets are in general distinct at the bottom levels ( Figure 4).

Discussion
This paper introduces two major contributions in the area of ontology-based literature mining research. First, we have for the first time systematically introduced the development of the INO ontology targeting for robust literature mining of gene-gene interaction types. It is noted that in addition to literature mining, INO is also being developed to model various interactions and networks among different molecules [18]. However, the INO development was initiated from meeting our literature mining need [6]. Second, we have proposed and implemented a novel INO-based gene-gene interaction enrichment strategy. The INO-based gene pair enrichment  These literature mining-related keywords are often variations and synonyms of the ontology term labels. The inclusion of these keywords significantly increases our coverage in literature mining. To better understand the interaction term coverage of INO, we have compared the INO system with the commonly used GENIA terminology system [7]. The GENIA term annotation system is grounded on the GENIA ontology that defines biomedically meaningful nominal concepts. Our comparison found that INO covers all 17 interaction types in the GENIA ontology.
To further examine the interaction term coverage of INO, we have also compared our system with the interaction terminology collection from the BioNLP Shared Task 2009, focusing on recognition of bio-molecular events reported in the biomedical literature (http://www. nactem.ac.uk/tsujii/GENIA/SharedTask/). Nine categories of bio-events were covered: gene expression, transcription, protein catabolism, localization, binding, phosphorylation, regulation, positive regulation, and negative regulation [25]. We used the BioNLP'09 Shared Task training data set that consists of 800 abstracts manually labeled for biomolecular events including the event trigger words (i.e., interaction keywords). These abstracts include 994 unique interaction keywords that are shown for 6,607 times in the data set. Our comparative analysis found that INO includes 279 of these 994 unique interaction keywords. These 279 keywords are used for 4,448 times, which corresponds to 67% of coverage if the keyword redundancy is considered. It is noted that many keywords (e.g., by, when, source, products, necessary, through) listed in the BioNLP'09 Shared Task training data are not considered as interaction keywords in INO. We will fully examine all the terms in the BioNLP'09 Shared Task data set and hopefully expand INO to include more interaction keywords.
Our INO-based literature mining study found that while it is relatively easy to describe the relation between two genes when only one interaction keyword exists in the sentence containing these two genes, it is difficult to The results were generated using OntoFox [9] with the OntoFox setting "includeComputedIntermediates", and visualized using the Protege-OWL editor (http://protege.stanford.edu/). The box-enclosed terms are over-or under-represented interaction types directly identified in our program (see Tables 2 and 3). Other terms not enclosed in boxes are terms retrieved by OntoFox to ensure the completeness of the hierarchies. describe the relation between the two genes if multiple keywords exist. For example, in the IFNG-IL12A neutralization-related interaction type (Table 2), we can infer that these two genes participate in a neutralizationrelated interaction(s). However, it does not mean that IFNG neutralizes IL12A, or vice versa. We can only say that these two genes interact somehow in a neutralizationrelated pattern.
It is likely that multiple interaction-related keywords co-exist in one sentence. For example, an IFNG-IL12 neutralization-related sentence is "In vitro IL-12 neutralization dramatically impaired the IFN-gamma response to S. typhimurium but not to ConA" [26]. This sentence contains two interaction-related keywords "neutralization" and "impaired". This is a complex relation where a neutralization of one gene impairs another gene expression. It hints that one gene positively regulates another. In this case, the neutralization is really an experimental condition. Our literature mining program retrieved both keywords independently without considering them together. Specifically, our current method identifies all the interaction keywords and maps each of them to corresponding INO interaction terms. However, we have not systematically modeled and integrated these coexisting terms into better understanding of the patterns of corresponding literature text. It would be more advanced if we could process these two keywords simultaneously and assign a unique interaction type, such as 'impairment after neutralization', which would be a subclass (or child term) of the existing INO term 'positive regulation'. While this example demonstrates a new direction of future research, such analysis does not undermine the contributions of the new INO-based literature mining strategy first reported in this manuscript. Indeed, our strategy provides a new start point and platform for further addressing these challenges.
The analysis of vaccine-associated interaction networks requires intensive research. The research reported here uses INO-based literature mining to analyze the vaccine-relevant gene-gene interactions. More research can be conducted to study vaccine-gene interactions and vaccine-associated adverse events. In addition to the PubMed literature resource used in this study, additional public resources such as Semantic MEDLINE, summarizing PubMed results into an interactive graph of semantic predications [27], and The Vaccine Adverse Event Reporting System (VAERS; https://vaers.hhs.gov), collecting vaccine-associated adverse events following the administrations with various licensed vaccines [28], may further improve the INO-based analysis. While Semantic MEDLINE and VAERS have been used in other vaccinerelated research [29,30], INO-based approaches are expected to advance the research on the interaction networks among vaccines, genes, and adverse events.
The integrative research combining INO and different resources would further facilitate our understanding of vaccine mechanisms and support public health.