Revealing protein functions based on relationships of interacting proteins and GO terms
© The Author(s). 2017
Published: 20 September 2017
In recent years, numerous computational methods predicted protein function based on the protein-protein interaction (PPI) network. These methods supposed that two proteins share the same function if they interact with each other. However, it is reported by recent studies that the functions of two interacting proteins may be just related. It will mislead the prediction of protein function. Therefore, there is a need for investigating the functional relationship between interacting proteins.
In this paper, the functional relationship between interacting proteins is studied and a novel method, called as GoDIN, is advanced to annotate functions of interacting proteins in Gene Ontology (GO) context. It is assumed that the functional difference between interacting proteins can be expressed by semantic difference between GO term and its relatives. Thus, the method uses GO term and its relatives to annotate the interacting proteins separately according to their functional roles in the PPI network. The method is validated by a series of experiments and compared with the concerned method. The experimental results confirm the assumption and suggest that GoDIN is effective on predicting functions of protein.
This study demonstrates that: (1) interacting proteins are not equal in the PPI network, and their function may be same or similar, or just related; (2) functional difference between interacting proteins can be measured by their degrees in the PPI network; (3) functional relationship between interacting proteins can be expressed by relationship between GO term and its relatives.
Characterizing protein functions is critical to understanding biological pathway, investigating disease and developing drugs [1, 2]. To elucidate protein functions, numerous research efforts have been made based on techniques ranging from sequence homology detection to text mining of scientific literature. However, only some of proteins are annotated with functional information for well-studied model organisms so far. The situations would be even worse for the other organisms.
The PPI network is usually supposed as non-directional. In fact, it is commonplace in the PPI network that regulation relationship, upstream-downstream relations between interacting proteins when they are involved in signal transduction, transcriptional regulation, cell cycle or metabolism . Moreover, it is reported by recent studies [21–23] that GBA is the exception rather than the rule in the PPI network and protein functions are determined by specific and critical interactions. Hence the relationship between interacting proteins may affect their functions and should be considered in the process of predicting protein functions.
In GO context, a series of standard terms are defined to describe characteristics of gene products (i.e. protein), and the terms are arranged as directed acyclic graph (DAG) hierarchy according to functional associations of them. Therefore, the functional information is not only expressed by semantics of terms but also contained in the hierarchy. Thus, the predictions of protein functions may be misled if the functional associations of terms are ignored. In fact, the information underlying in GO hierarchy are crucial for functional predicting of proteins.
In this paper, we mainly study two problems: (1) how to measure the functional difference between interacting proteins; (2) how to demonstrate functional difference between the interacting proteins in GO context. To solve above problems, we advance a novel method to predict protein functions by diffusing GO terms in the directed PPI network (GoDIN). Firstly, the relationship between interacting proteins is generalized as functional proactive-reactive. It is assumed that the proactive protein performs fewer and more specific functions than the reactive protein. And then a directed PPI network is generated according to the functional proactive-reactive relationships of interacting proteins. Secondly, a coefficient variation is defined to measure functional difference between interacting proteins. Finally, functional associations of GO terms are taken into consideration in the process of annotating interacting proteins. By a proposed iterative algorithm, GO terms are allocated to describe protein functions in the PPI network under the control of coefficient variations. The method will be illustrated in the following section.
Functional relationship between interacting proteins
As reported, many proteins play functional roles that are different from their neighbors in the PPI network. For example, a protein annotated with terms: “RNA transport”, “RNA binding” may involve in translation mechanism and bind with diverse functional unrelated proteins . For instance, the function of proteins which help others fold correctly may be unrelated to that of their partners. These proteins are more likely to be hubs than others in the PPI network. The hubs often have many partners and may involve in several different biological activities. In general, a protein is multi-functional if it takes part in many different biological activities. As reported , the more multi-functionality of a protein is, the less specific is its function. Besides, Gillis et al. also found that the multi-functionality of a protein is highly correlated with its degree in the PPI network. Specifically, a protein with high degree may perform general function so that they could collaborate with other proteins in diverse biological activities. It can be considered that the low degree proteins are proactive and the high degree proteins are reactive in biological activities. Thus, the relationship between interacting proteins can be generalized as functional proactive-reactive according to their degrees in PPI network.
Measuring functional difference between interacting proteins
Annotate the interacting proteins with GO terms based on their functional difference
In traditional methods, the known GO terms of a protein were directly associated with interacting partners of the protein. These methods ignored the functional difference between the interacting proteins. In fact, the functions of interacting proteins may be same or similar, or related but different. Therefore, the relatives of known terms of a protein are selected to annotate interacting partners of the protein in our method.
Generally speaking, this process provides three kinds of predictions: (1) some ancestors of the known terms of the proactive protein may be appropriate to describe the reactive protein; (2) some descendants of the known terms of the reactive protein can annotate the proactive protein; (3) terms of two interacting proteins can be shared directly by them if the proteins are equal in the PPI network.
Diffusing functional information in the PPI network
Step 1: Select seed proteins from annotated proteins of which proactive partners have not been annotated yet;
Step 2: Select relatives of known terms of seed proteins to describe functions of their interacting partners according to formulas (4) and (5);
Step 3: Update terms of seed proteins based on their annotated reactive partners according to formulas (4) and (5);
Step 4: Remove seed proteins from the annotated proteins; the edges related to the seed proteins cannot mediate diffusing between interacting proteins; and go to step 1 until all proteins in the PPI network are annotated or there does not exist annotated partners for remained unannotated proteins.
Time complexity analysis
Given a PPI network including n proteins, the time complexity of determining functional relationship between proteins is O(n2). Similarly, the time complexity of measuring functional difference between proteins is O(n2) too. If the proteins is at most annotated by p GO terms, and the maximum degree of the proteins is k, the time complexity of diffusing functional information between two proteins is O(p × k). Accordingly, diffusing functional information in the whole PPI network is O(m × p × k) if there are m proteins are annotated in the PPI network. Based on these analysis, the time complexity of the GoDIN should be O(n2) + O(n2) + O(m × p × k). Because the maximal value of m is n and the maximal value of k is n-1, the time complexity of the GoDIN is about O(n2).
A simple example of GoDIN
In the first iteration, M is regarded as a seed protein and its neighbors include E, L and J. According to the formula (1) and (2), O(M, E) is 1 and CV(M, E) is 1/6. The known GO term of M is g7 and the semantic value of g7, S(g7) is 0.35. By replacing parameters in the formula (4) with these data, S(g7 *) is estimated as 0.408. According to the formula (5), g8 is appropriate to annotate protein E. Similarly, the annotations of L and J are predicted by the same means. Note that, because L has been annotated before diffusion, L’s term g4 should also be diffused to the seed proteins M. According to True Path Rule (TPR), g4 also annotates M if g4 is an ancestor of g7. Thus, the annotations of M cannot be changed by GO term g4. In addition, the protein M cannot be selected as a seed protein again and arches M ← J, M ← E, M↔L cannot be used to diffuse GO terms again.
In the second iteration, J, E and L are candidates for seed proteins. Because the protein L has a proactive annotated partner E, L cannot be taken as a seed protein. Therefore, J and E are selected as seed proteins. According to the formula (1), O(J, A) is 0, which means that the protein J and A share the same function. Thus, protein A can be annotated with term g8, which is also can be inferred though the formula (4) and (5). Different from L, A has not been annotated at all before diffusion, so it does not need to infer annotations of J from those of A. As for the seed protein E and its partner L, O(A, L) is −1 and CV(A, L) = 1/6 in term of the formula (1) and (2). Based on these parameters and S(g8), S(g8 *) = (1 + 1/6)-1 × 0.4 = 0.343. Therefore, term g7 is selected to annotate L in term of the formula (5). After that, protein J and E cannot be regarded as seed proteins and arches J↔A and E → L cannot be used in the other iterations.
The processes of the 3rd, 4th, 5th iterations are similar to the previous iterations. Due to Space Limitations, the details of these iterations are not described here. In the 6th iteration, it can be found that all proteins in the subnetwork have been annotated already and no arch which can mediate diffusing between interacting proteins remains. Thus, the iteration is terminated and the diffusing of GO terms though the subnetwork is finished. The result of inferences are collected and listed in the table.
Experiments and discussions
Basic information of the three PPI networks
Functional relationship between interacting proteins
Functional relationship of interacting proteins
From Table 2, it can be seen that nearly 60% of interactions in the three networks belong to the first group; about 40% of interactions belong to the second group; only less than 1% of interactions belong to the third group. As far as we know, none of methods relying on PPI network could annotate the interacting protein correctly in the third group. The traditional methods supposed that the interacting proteins share the same term. Thus, about 40% of functional predictions may not be correct. Meanwhile, the results suggest that the majority of interacting proteins share the same or similar terms, which is consistent with basic assumptions of GoDIN.
Functional difference between interacting proteins
Relationship between function and degree of interacting proteins in the same annotation group
Relationship between function and degree of interacting proteins in the similar annotation group
To explain this phenomenon, coefficient variation is used to measure functional difference between the interacting proteins. The coefficient variations of proteins with different degrees in the same annotation group are compared with those in the similar group. As shown in Fig. 3, the box-whisker plots are used to display the distributions of coefficient variations of different groups. In the figure, the distributions of the coefficient variations in the same annotation groups are represented by dashed boxes and lines. Meanwhile, the distributions of coefficient variations in the similar annotation groups are represented by solid boxes and lines. As known, the bottom and top of the boxes are always the first and third quartiles of coefficient variations, and the bands inside the boxes are the second quartiles (the median) of coefficient variations, and the hollow spots inside the boxes are the averages of coefficient variations. For clear, the same annotation groups of the three networks: Krogan, DIP and BioGRID are marked as SameKrogan, SameDIP, SameBIO respectively. Accordingly, the similar annotation groups of those networks are signed as SimilarKrogan, SimilarDIP and SimilarBIO.
Comparison with the related methods
As shown in Fig. 5, the precision of GoDIN is comparable to the best methods: CIA and FCML on Krogan. Meanwhile, GoDIN shows better precision than the other methods on DIP and BioGRID. FunFlow performs better than the others on DIP but it shows the lower precision than other methods on Krogan and BioGRID. In GoDIN, the functional differences of interacting proteins are considered and the differences of terms are used to demonstrate the functional differences during predicting protein function. This is why GoDIN shows better performances than the others in term of the precision. The functional relationships of terms are also considered thoroughly in CIA and FCML, but they pay no attention to the functional differences of interacting proteins. FunFlow ignores the functional relationships of terms in the process of predicting protein function so that it performs not as well as the others.
In addition, it is also found that all of the methods show relatively low accuracy. This may be due to two issues: (1) the large number of GO terms; (2) the dependency of GO terms. The influence of the above issues will be more obvious while the proteins are annotated by more terms. This would be a place to start the future study.
As displayed in Fig. 6, FunFlow shows the best recall on almost all of the networks while GoDIN performs better on most of the networks and annotation aspects than FCML and CIA. The performances of CIA are not better than those of FCML. This may be attributed to global characteristics and local characteristics of PPI network. Specifically, CIA only takes local characteristics of PPI network into consideration in predicting protein functions while the other methods consider both global and local characteristics of PPI network. This may be the reasons why the recall of CIA is lower than those of the other methods. Besides, some proteins in the datasets are annotated by shallow terms, and the misjudgments on these proteins have obvious negative impact on the recall. This would be a place to start our future study.
Predicting protein function based on PPI network is a hotspot of biological research in recent years. In this paper, the functional relationship between interacting proteins is studied and a novel method of protein function prediction is proposed based on the relationship. To validate the effectiveness of the method, a series of analysis and experiments are performed on the three high reliable networks from the different annotation aspects. The results suggest that: (1) interacting proteins are not equal in the PPI network, and their function may be same or similar, or just related; (2) functional difference between interacting proteins can be measured by their degrees in the PPI network; (3) functional relationship between interacting proteins can be expressed by semantic relationship between GO term and its relatives; (4) compared with the other concerned methods, GoDIN has high precision and f-measure and it is effective on predicting protein function.
We would like to thank the editors and the anonymous reviewers for their comments that led to significant improvements in our manuscript.
Publication of this article was funded by Fundamental Research Funds for the Central Universities (DB13AB02, DL13AB02) and Natural Science Foundation of China (61,671,189, 61,271,346, 61,571,163, 61,532,014 and 91,335,112).
Availability of data and materials
About this supplement
This article has been published as part of Journal of Biomedical Semantics Volume 8 Supplement 1,2017: Selected articles from the Biological Ontologies and Knowledge bases workshop. The full contents of the supplement are available online at https://jbiomedsem.biomedcentral.com/articles/supplements/volume-8-supplement-1.
Conceived and designed the approach: ZXT, MZG. Implemented the approach and performed the experiments: ZXT,ZT, KC. Analyzed the results: ZXT, MZG, XYL, ZT. Contributed to the writing of the manuscript: ZXT, MZG, XYL. All the authors have approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Liang C, et al. DisSim: an online system for exploring significant similar diseases and exhibiting potential therapeutic drugs. Sci Rep. 2016;6:30024.View ArticleGoogle Scholar
- Peng J, Bai K, Shang X, Wang G, Xue H, Jin S, Cheng L, Wang Y, Chen J. Predicting disease-related genes using integrated biomedical networks. BMC Genomics. 2017;18(1):1043.View ArticleGoogle Scholar
- Zeng X, Zhang X, Zou Q. Integrative approaches for predicting microRNA function and prioritizing disease-related microRNA using biological interaction networks. Brief Bioinform. 2016;7(2):193–203.View ArticleGoogle Scholar
- Quan Z, et al. Similarity computation strategies in the microRNA-disease network: a survey. Brief Funct Genomics. 2016;15(1):55–64.Google Scholar
- Sharan R, Ulitsky I, Shamir R. Network-based prediction of protein function. Mol Syst Biol. 2007;3:88.View ArticleGoogle Scholar
- Schwikowski B, Uetz P, Fields S. A network of protein–protein interactions in yeast. Nat Biotechnol. 2000;18(12):1257–61.View ArticleGoogle Scholar
- Nabieva E, et al. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics. 2005;21(Suppl 1):i302–10.View ArticleGoogle Scholar
- Lee I, Li Z, Marcotte EM. An improved, bias-reduced probabilistic functional gene network of baker's yeast, Saccharomyces cerevisiae. PLoS One. 2007;2(10):e988.View ArticleGoogle Scholar
- Mostafavi S, et al. GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biol. 2008;9(Suppl 1):S4.View ArticleGoogle Scholar
- Kourmpetis YA, van Dijk A, Ter Braak CJ. Gene ontology consistent protein function prediction: the FALCON algorithm applied to six eukaryotic genomes. Algorithms Mol Biol. 2013;8(1):10.View ArticleGoogle Scholar
- Kourmpetis YA, et al. Bayesian Markov random field analysis for protein function prediction based on network data. PLoS One. 2010;5(2):e9293.View ArticleGoogle Scholar
- Bader GD, Hogue CW. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003;4(1):2.View ArticleGoogle Scholar
- Bader JS, et al. Gaining confidence in high-throughput protein interaction networks. Nat Biotechnol. 2004;22(1):78–85.View ArticleGoogle Scholar
- Adamcsek B, et al. CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics. 2006;22(8):1021–3.View ArticleGoogle Scholar
- Ahn YY, Bagrow JP, Lehmann S. Link communities reveal multiscale complexity in networks. Nature. 2010;466(7307):761–4.View ArticleGoogle Scholar
- Janga SC, Diaz-Mejia JJ, Moreno-Hagelsieb G. Network-based function prediction and interactomics: the case for metabolic enzymes. Metab Eng. 2011;13(1):1–10.View ArticleGoogle Scholar
- Chi X, Hou J. An iterative approach of protein function prediction. BMC Bioinformatics. 2011;12(71):16107–12.Google Scholar
- Wang H, Huang H, Ding C. Function-function correlated multi-label protein function prediction over interaction networks. J Comput Biol. 2013;20(4):322–43.MathSciNetView ArticleGoogle Scholar
- Huntley RP, et al. The GOA database: gene ontology annotation updates for 2015. Nucleic Acids Res. 2015;43(D1):D1057–63.MathSciNetView ArticleGoogle Scholar
- Liu W, et al. Proteome-wide prediction of signal flow direction in protein interaction networks based on interacting domains. Mol Cell Proteomics. 2009;8(9):2063–70.View ArticleGoogle Scholar
- Gillis, J. and P. Pavlidis, “Guilt by association” is the exception rather than the rule in gene networks. PLoS Comput Biol, 2012. 8(3): p. e1002444.Google Scholar
- Gillis J, Pavlidis P. The impact of multifunctional genes on "guilt by association" analysis. PLoS One. 2011;6(2):e17258.View ArticleGoogle Scholar
- Cao M, et al. Going the distance for protein function prediction: a new distance metric for protein interaction networks. PLoS One. 2013;8(10):e76339.View ArticleGoogle Scholar
- Teng Z, et al. Measuring gene functional similarity based on group-wise comparison of GO terms. Bioinformatics. 2013;29(11):1424–32.View ArticleGoogle Scholar
- Krogan NJ, et al. Global landscape of protein complexes in the yeast Saccharomyces Cerevisiae. Nature. 2006;440(7084):637–43.View ArticleGoogle Scholar
- Lukasz S, et al. The database of interacting proteins: 2004 update. Nucleic Acids Res. 2004;32(22):D449–51.Google Scholar
- Andrew CA, et al. The BioGRID interaction database: 2015 update. Nucleic Acids Res. 2015;43(1):D637–40.Google Scholar
- Radivojac P, et al. A large-scale evaluation of computational protein function prediction. Nat Methods. 2013;10(3):221–7.View ArticleGoogle Scholar