A shortest-path graph kernel for estimating gene product semantic similarity
© Alvarez et al; licensee BioMed Central Ltd. 2011
Received: 27 February 2011
Accepted: 29 July 2011
Published: 29 July 2011
Existing methods for calculating semantic similarity between gene products using the Gene Ontology (GO) often rely on external resources, which are not part of the ontology. Consequently, changes in these external resources like biased term distribution caused by shifting of hot research topics, will affect the calculation of semantic similarity. One way to avoid this problem is to use semantic methods that are "intrinsic" to the ontology, i.e. independent of external knowledge.
We present a shortest-path graph kernel (spgk) method that relies exclusively on the GO and its structure. In spgk, a gene product is represented by an induced subgraph of the GO, which consists of all the GO terms annotating it. Then a shortest-path graph kernel is used to compute the similarity between two graphs. In a comprehensive evaluation using a benchmark dataset, spgk compares favorably with other methods that depend on external resources. Compared with simUI, a method that is also intrinsic to GO, spgk achieves slightly better results on the benchmark dataset. Statistical tests show that the improvement is significant when the resolution and EC similarity correlation coefficient are used to measure the performance, but is insignificant when the Pfam similarity correlation coefficient is used.
Spgk uses a graph kernel method in polynomial time to exploit the structure of the GO to calculate semantic similarity between gene products. It provides an alternative to both methods that use external resources and "intrinsic" methods with comparable performance.
The Gene Ontology (GO)  systematically organizes knowledge by means of well-structured controlled vocabularies and provides consistent descriptions to organisms across species. GO terms have been widely used to annotate genes and gene products in the Gene Ontology Annotation (GOA) project . As the GO becomes more and more important in biomedical research, computational methods are often needed to explore the GO to calculate the semantic similarity between gene products. Such methods have been used in a broad range of applications, including: clustering of genes in pathways [3–6], prediction of protein-protein interactions , and the evaluation of similarity between gene products with respect to expression profiles , protein sequence [9–11], protein function , and protein family .
The semantic similarity between two gene products is usually calculated based on the term similarity. First, pairwise semantic similarities between GO terms that annotate the gene products are calculated. Then, the these pairwise similarities are combined to derive an overall semantic similarity between the gene products. Different methods have been used to combine pairwise GO term similarities in previous research [4, 8, 10, 11, 14, 15]. A representative collection of methods for calculating the semantic similarity between GO terms has been reviewed in . Most of those methods use the information content (IC) of the nearest common ancestor (NCA) or most informative common ancestor (MICA) to quantify the amount of shared information between two GO terms. However, the IC is calculated based on the frequency of GO terms in external resources, such as GOA databases. External resources change as knowledge is updated (e.g., more annotations are included in GOA). Consequently, for the same pair of GO terms, their semantic similarity computed by these methods might change as the external resources evolve. However, semantic similarities between GO terms should not be affected by such changes. In addition, certain annotations might be frequent simply because of popular research topics, leading to biased results. Some other methods rely on distance measures [17, 18], e.g. counting the number of edges on the shortest path between the involved terms in the GO, to compute the GO term similarity. One shortcoming of this approach is that the edges in the GO do not imply equal length in semantics. Although some methods tried to address this issue by assigning different weights to edges at different levels, they still suffer from the fact that GO terms at the same level do not necessarily have the same specificity. Other methods calculate the semantic similarity between gene products without considering the semantic similarity between GO terms. In these methods, a gene product is represented by a set or a vector of GO terms that annotate it. Then, the semantic similarity between gene products is calculated as the overlap between sets or the inner product of vectors [4, 10]. However, these methods did not exploit the structure of the GO and ignored the relationship between GO terms.
To address the aforementioned issues, we propose a shortest-path graph kernel (spgk) method for calculating the semantic similarity between gene products. In spgk, each gene product is represented as a graph, which is an induced subgraph of the GO. Then a graph kernel method is used to calculate the semantic similarity between the graphs. Spgk is intrinsic to the GO, i.e., it does not rely on external resources to calculate the semantic similarity. Thus, it does not have the same drawbacks as the methods based on the IC of GO terms. At the same time, it uses a graph to explicitly explore the GO structure and exploit the relationship between GO terms. Graph matching is computationally expensive in general, being an NP-complete problem on general graphs. To reduce the computational complexity, we develop a graph kernel to calculate the similarity between graphs. Using a comprehensive evaluation benchmark developed by another group, we compare spgk with other state-of-the-art methods.
In this section, we present a novel method for calculating the semantic similarity between proteins. First, we introduce basic background of the Gene Ontology. Then we describe the details of the graph kernel method.
Gene ontology and gene ontology annotations
The GO project  maintains a dynamic, structured, precisely defined, and controlled vocabulary of terms for describing the properties of gene products across species. The GO consists of three different ontologies describing: 1) biological processes (BP), where a process often involves a chemical or physical transformation (e.g. cell growth); 2) molecular functions (MF), where functions are defined as the biochemical activity of gene products (e.g. enzymes); and 3) cellular components (CC), which refers to places in the cell where gene products are active (e.g. nuclear membrane). Each ontology is structured as a directed acyclic graph, where nodes (GO terms) are linked to each other through "is-a", "part-of" or "regulates" relationships. On the other hand, the annotation of gene products is the process of assigning ontology terms to gene products in order to describe their activities and localization. For example, the GOA project , at the European Bioinformatics Institute (EBI), aims to provide high-quality electronic and manual annotations to UniProt KnowledgeBase (UniProtKB) entries . GOA annotations are obtained from strictly controlled methods, where every association is supported by a distinct evidence source. A protein can be annotated with multiple GO terms from any of the three ontologies in the GO. Functional annotations of UniProtKB proteins currently consist of over 32 million annotations, which cover more than 4 million proteins .
Graph representation of proteins
A shortest-path graph kernel for proteins
Previous study by Xu et al.  shows that having more annotations per protein in the dataset leads to more reliable functional similarity estimation from the GO. Thus, for the purpose of evaluation, we carefully selected a set of 100 proteins from GOA, such that they were the top 100 proteins with the highest numbers of annotations. We also ensured that for any selected protein: 1) it existed in the UniProtKB/Swiss-Prot database, 2) it had at least one annotation from each of the three ontologies in GOA-Uniprot, and 3) it had at least one Pfam-A annotation. The evaluation proceeded as follows: First, the graph kernel was used to calculate pairwise semantic similarities for a set of proteins. Second, pairwise functional similarities between the proteins were calculated based on the Pfam database annotations. Last, the Pearson's Correlation Coefficient between the semantic and functional similarities was calculated. If two proteins have similar function, then a good semantic similarity method should detect high semantic similarity between them. Thus, higher values of Pearson's Correlation Coefficient indicate better performance in the calculation of the semantic similarity. This procedure was repeated for each of the three ontologies in the GO, namely, BP, MF, and CC.
Results and discussion
In our experiments, we used the revision 1.723 of the GO and the release 74.0 of GOA-Uniprot, where GO terms are assigned to proteins in UniProtKB by manual and electronic methods . As mentioned before, the GO contains three different ontologies that describe gene products in terms of their associated biological processes, molecular functions, and cellular components.
Performance of spgk
Performance of spgk.
Pearson's Correlation Coefficient
Comparison of spgk with state-of-the-art methods
To compare spgk with other existing methods, we used the Collaborative Evaluation of GO-Based Semantic Similarity Measures (CESSM) online tool . This tool has been made available by the XLDB research group at the University of Lisbon. For the purpose of comparisons, CESSM provides a standard dataset consisting of 13,340 pairs of proteins involving 1,039 distinct proteins and implements 11 state-of-the-art semantic similarity methods, namely, simGIC and simUI , and three versions (the average, maximum and best-match average) of three different term similarity methods, namely Resnik , Lin , and Jiang & Conrath . As a result, users can compare their methods with the 11 methods using the standard dataset.
As pointed out by Pesquita et al.  in a comprehensive evaluation, the maximum and average versions of term similarity methods have limitations from a biological point of view. Comparisons using the standard datasets at CESSM also confirmed that the best-match average version has better performance than the maximum and average versions for Resnik , Lin  and Jiang & Conrath  methods. Thus, in this section, we will compare spgk with simGIC, simUI, and the best-match average version of Resnik , Lin  and Jiang & Conrath  methods using CESSM. CESSM provides three different ways for evaluating a semantic similarity method, i.e., comparing the resulting semantic similarities with (1) functional similarities measured as sequence similarities, (2) functional similarities derived from enzyme commission (EC) classification, and (3) functional similarities derived from Pfam annotations.
Jiang & Conrath
Jiang & Conrath
Jiang & Conrath
The spgk method achieves the best results in tables 2 and 3, and is the second best in table 4. In addition to the better performance, the key advantage of spgk is that it is intrinsic to the ontology, i.e., it does not rely on external resources in the calculation of the semantic similarity. In contrast, all the other methods (except simUI) shown in tables 2, 3 and 4, rely on external resources, i.e., the annotations in GOA. Despite the high computational cost associated with the general graph comparisons, spgk does not suffer from this drawback. Using the shortest-path graph kernel, spgk requires a polynomial time (O(n 4 )), where n is the number of vertices. In additioin, each step of the graph kernel is simple to compute. For example, k node only needs to compare whether two vertex IDs are identical, and k edge considers the length difference between two edges. Thus, the constant factors associated with the polynomial time complexity are very small and spgk can run very fast in real applications.
SimUI is also intrinsic to the ontology. In simUI, the semantic similarity between two proteins is defined as the fraction between the number of GO terms shared by the two proteins and the number of GO terms in their union. Thus, simUI requires only a linear time (O(n)) and has the advantage that it is simple and faster for calculation. However, tables 2, 3, 4 show that spgk slightly outperformed simUI in all cases. We estimated the statistical significance of the improvement of spgk over simUI using Fisher's transformation. The p values were less than 0.001 when resolution was used to measure performance (table 2), 0.0384 for the EC similarity correlation coefficient (table 3) and 0.2266 for the Pfam similarity correlation coefficient (table 4). Therefore, compared with the conventional threshold of 0.05, the improvement is significant when the performance is measured by resolution and EC similarity correlation coefficient, but is insignificant when measured by Pfam similarity correlation coefficient. Comparing tables 2, 3, 4, we can see that the performance in table 4 is the poorest for all the methods. That might partially explain why the improvement is insignificant when Pfam similarity correlation coefficient is used as the measurement (table 4).
In this manuscript, we have presented a method (spgk) that computes the semantic similarity between gene products using only information intrinsic to GO. In comprehensive evaluations using a benchmark dataset, spgk compares favorably with other state-of-the-art methods that depend on external resources. Compared to simUI, spgk achieves slightly better results but also has a higher time complexity. A big difference between spgk and simUI is that spgk takes into account the structure of the ontology. Since the structure of the ontology contains important information, it is important to exploit them to capture semantic similarity. The results presented here show that spgk provides an alternative to both methods that rely on external resources and "intrinsic" methods with comparable performance.
In light of future development, there are still some limitations in spgk at its current form. For example, in spgk, the function (k node ) that compares nodes only considers whether the two nodes are identical. However, each node in the GO is associated with a text definition, which contains rich information that is useful for deriving biological relationship between nodes. Thus, one direction for future improvement is to take into account the semantics of the text definition when comparing nodes. Furthermore, the k edge function only considers the length difference between two paths. In GO, the edges are associated with different types of relationship. Since different types of relationship have different biological meanings, they should be given different weights. Thus, another direction for improvement is to systematically explore weighting methods that assign different weights to the edges based on the biological relationships.
We would like to thank the XLDB Research Team from the University of Lisbon for providing an online tool for the evaluation of GO-based semantic similarity measures. In particular, we thank Catia Pesquita for all the kind support given for using their tool. This project was partially supported by NIH Grant Number P20 RR016471 from the INBRE Program of the National Center for Research Resources.
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.View ArticleGoogle Scholar
- Barrell D, Dimmer E, Huntley RP, Binns D, O'Donovan C, Apweiler R: The GOA database in 2009--an integrated Gene Ontology Annotation resource. Nucl Acids Res. 2009, 37: D396-403. 10.1093/nar/gkn803.View ArticleGoogle Scholar
- Wang JZ, Du Z, Payattakool R, Yu PS, Chen C-F: A new method to measure the semantic similarity of go terms. Bioinformatics. 2007, 23: 1274-1281. 10.1093/bioinformatics/btm087.View ArticleGoogle Scholar
- Sheehan B, Quigley A, Gaudin B, Dobson S: A relation based measure of semantic similarity for gene ontology annotations. BMC Bioinformatics. 2008, 9: 468-10.1186/1471-2105-9-468.View ArticleGoogle Scholar
- Nagar A, Al-Mubaid H: A new path length measure based on go for gene similarity with evaluation using sgd pathways. Proceedings of IEEE International Symposium on Computer-Based Medical Systems. 2008, 590-595.Google Scholar
- Du Z, Li L, Chen C-F, Yu PS, Wang JZ: G-sesame: web tools for go-term-based gene similarity analysis and knowledge discovery. Nucl Acids Res. 2009, 37: W345-349. 10.1093/nar/gkp463.View ArticleGoogle Scholar
- Xu T, Du L, Zhou Y: Evaluation of GO-based functional similarity measures using S. cerevisiae protein interaction and expression profile data. BMC Bioinformatics. 2008, 9: 472-10.1186/1471-2105-9-472.View ArticleGoogle Scholar
- Sevilla JL, Segura V, Podhorski A, Guruceaga E, Mato JM, Martinez-Cruz LA, Corrales FJ, Rubio A: Correlation between gene expression and go semantic similarity. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2005, 2: 330-338. 10.1109/TCBB.2005.50.View ArticleGoogle Scholar
- Pesquita C, Faria D, Bastos H, Ferreira AE, Falcão AO, Couto FM: Metrics for go based protein semantic similarity: a systematic evaluation. BMC Bioinformatics. 2008, 9: 5-10.1186/1471-2105-9-5.View ArticleGoogle Scholar
- Mistry M, Pavlidis P: Gene ontology term overlap as a measure of gene functional similarity. BMC Bioinformatics. 2008, 9: 327-10.1186/1471-2105-9-327.View ArticleGoogle Scholar
- Lord PW, Stevens RD, Brass A, Goble CA: Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation. Bioinformatics. 2003, 19: 1275-1283. 10.1093/bioinformatics/btg153.View ArticleGoogle Scholar
- Fontana P, Cestaro A, Velasco R, Formentin E, Toppo S: Rapid Annotation of Anonymous Sequences from Genome Projects Using Semantic Similarities and a Weighting Scheme in Gene Ontology. PLoS ONE. 2009, 4: e4619-10.1371/journal.pone.0004619.View ArticleGoogle Scholar
- Couto FM, Silva MJ, Coutinho PM: Measuring semantic similarity between gene ontology terms. Data and Knowledge Engineering. 2007, 16: 137-152.View ArticleGoogle Scholar
- Schlicker A, Domingues F, Rahnenfuhrer J, Lengauer T: A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics. 2006, 7: 302-10.1186/1471-2105-7-302.View ArticleGoogle Scholar
- Alvarez M, Qi X, Yan C: GO-Based Term Semantic Similarity. Ontology Learning and Knowledge Discovery Using the Web: Challenges and Recent Advances. Edited by: Wong W, Liu W, Bennamoun M. 2011, Pennsylvania: IGI-Global, 174-185.View ArticleGoogle Scholar
- Pesquita C, Faria D, Falcão AO, Lord P, Couto FM: Semantic similarity in biomedical ontologies. PLOS Computational Biology. 2009, 5: e1000443-10.1371/journal.pcbi.1000443.View ArticleGoogle Scholar
- Cheng J, Cline M, Martin J, Finkelstein D, Awad T, Kulp D, Siani-Rose MA: A knowledge-based clustering algorithm driven by gene ontology. J Biopharm Stat. 2004, 14: 687-700. 10.1081/BIP-200025659.MathSciNetView ArticleGoogle Scholar
- Wu X, Zhu L, Guo J, Zhang D-Y, Lin K: Prediction of yeast proteinprotein interaction network: insights from the gene ontology and annotations. Nucl Acids Res. 2006, 34: 2137-2150. 10.1093/nar/gkl219.View ArticleGoogle Scholar
- The UniProt Consortium: The Universal Protein Resource (UniProt) in 2010. Nucl Acids Res. 2010, 38: D142-148.View ArticleGoogle Scholar
- Borgwardt KM, Ong CS, Schonauer S, Vishwanathan SVN, Smola AJ, Kriegel H-P: Protein function prediction via graph kernels. Bioinformatics. 2005, 21: i47-56. 10.1093/bioinformatics/bti1007.View ArticleGoogle Scholar
- Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, H-R Hotz, Ceric G, Forslund K, Eddy SR, Sonnhammer ELL, Bateman A: The pfam protein families database. Nucl Acids Res. 2008, 36: D281-288. 10.1093/nar/gkn226.View ArticleGoogle Scholar
- Pesquita C, Pessoa D, Faria D, Couto F: CESSM: Collaborative Evaluation of Semantic Similarity Measures. Proceedings of JB2009: Challenges in Bioinformatics Lisbon, Portugal. 2009Google Scholar
- Resnik P: Using information content to evaluate semantic similarity in a taxonomy. Proceedings of International Joint Conference on Artificial Intelligent. 1995, 448-453.Google Scholar
- Lin D: An information-theoretic definition of similarity. Proceedings of International Conference on Machine Learning. 1998, 296-304.Google Scholar
- Jiang JJ, Conrath DW: Semantic similarity based on corpus statistics and lexical taxonomy. Proceedings of International Conference Research on Computational Linguistics. 1997, 19-33.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.