Constructing an integrated gene similarity network for the identification of disease genes

Tian, Zhen; Guo, Maozu; Wang, Chunyu; Xing, LinLin; Wang, Lei; Zhang, Yin

doi:10.1186/s13326-017-0141-1

Research
Open access
Published: 20 September 2017

Constructing an integrated gene similarity network for the identification of disease genes

Zhen Tian¹,
Maozu Guo¹,
Chunyu Wang¹,
LinLin Xing¹,
Lei Wang² &
…
Yin Zhang²

Journal of Biomedical Semantics volume 8, Article number: 32 (2017) Cite this article

5083 Accesses
22 Citations
1 Altmetric
Metrics details

Abstract

Background

Discovering novel genes that are involved human diseases is a challenging task in biomedical research. In recent years, several computational approaches have been proposed to prioritize candidate disease genes. Most of these methods are mainly based on protein-protein interaction (PPI) networks. However, since these PPI networks contain false positives and only cover less half of known human genes, their reliability and coverage are very low. Therefore, it is highly necessary to fuse multiple genomic data to construct a credible gene similarity network and then infer disease genes on the whole genomic scale.

Results

We proposed a novel method, named RWRB, to infer causal genes of interested diseases. First, we construct five individual gene (protein) similarity networks based on multiple genomic data of human genes. Then, an integrated gene similarity network (IGSN) is reconstructed based on similarity network fusion (SNF) method. Finally, we employee the random walk with restart algorithm on the phenotype-gene bilayer network, which combines phenotype similarity network, IGSN as well as phenotype-gene association network, to prioritize candidate disease genes. We investigate the effectiveness of RWRB through leave-one-out cross-validation methods in inferring phenotype-gene relationships. Results show that RWRB is more accurate than state-of-the-art methods on most evaluation metrics. Further analysis shows that the success of RWRB is benefited from IGSN which has a wider coverage and higher reliability comparing with current PPI networks. Moreover, we conduct a comprehensive case study for Alzheimer’s disease and predict some novel disease genes that supported by literature.

Conclusions

RWRB is an effective and reliable algorithm in prioritizing candidate disease genes on the genomic scale. Software and supplementary information are available at http://nclab.hit.edu.cn/~tianzhen/RWRB/.

Background

Prioritization of candidate disease genes is a fundamental challenge in human health with applications to understand disease mechanisms, diagnosis and therapy [1,2,3,4,5]. Many human diseases are complex and polygenic, involving linking genomic variation to clinical phenotype. Traditional linkage analyses and association study have conducted susceptible genomic interval in the chromosomes [6,7,8]. However, since the susceptible locus may contain several hundreds of genes, computational approaches are widely accepted to further infer causal genes that are associated with interested diseases [9,10,11].

Given a disease and its disease genes, the target of prioritization is usually to measure the similarity between candidate genes and the disease genes [1, 12, 13]. It is generally believed that it is the abnormal expression of disease genes that lead to the diseases happen. The disease genes are also called causal genes or disease related genes for the diseases sometimes. Many methods which take the “guilt by association” principle have been proposed to prioritize candidate genes based on a comprehensive range of biological information [10, 14,15,16,17,18,19,20,21]. They are devoted to fully characterize genes (or corresponding gene products), to measure the similarity between known disease genes and candidate genes more precisely and reliably. These methods are usually called feature-based methods [22]. The metric of similarity is generally based on sequence-based features of genes [23,24,25], functional annotation of genes [13, 26, 27] and protein-protein interaction data [28, 29]. The ultimate goal is to discriminate disease genes and non-disease genes based on certain characteristics of genes [30, 31].

More recently, many methods [32,33,34,35,36,37,38] make use of phenotype similarity between diseases to prioritize candidate disease genes [39, 40]. This is because phenotypic similarity of diseases can help increase the total number of known disease genes for less studied disease phenotypes [41]. The underlying assumption for these methods is that similar phenotypes are caused by functionally related genes [12, 42]. These methods are usually called similarity-based methods [22]. Lage [2] built a Bayesian model based on PPI network and phenotype similarity network, and then prioritized the candidate genes with the help of candidate protein complex. Kohler [32] first grouped diseases into families and then employed a random walk from known disease genes in its family to prioritize candidate genes. Later, Wu [33] put forward a regression model, named CIPHER, to exploit phenotype-gene associations. More recently, Li [35] first constructed a heterogeneous network by making the best use of the phenotype similarity network and gene network as well as the phenotype-gene relationship information. Then they employed the random walk model, called RWRH, to infer disease genes.

Most methods for prioritizing candidate disease genes above mainly rely on PPI networks. However, current PPI networks mainly have two shortcomings. One is that the coverage of the available PPI networks is typically low [29, 43, 44]. Since the curated physical interactions are generally preferred, they often lead to insufficient coverage in human genome [45]. This may result in a serious problem that some known disease genes cannot be mapped into the PPI networks. To address this issue, several researchers [6, 46,47,48] have attempted to construct gene semantic similarity network. For instance, Li [6] employ a random walk with restart algorithm on the multigraphs, which merges various genomic networks to enlarge the range of candidate genes and increase the noise tolerance of networks. However, these different genomic networks do not integrate indeed. The weights assigned to different networks are also difficult to confirm.

The other is the low reliability of PPI networks [49]. Since a single data source is prone of bias and incompleteness, integration of various genomic data sources is highly demanded for the study of disease gene prioritization [6, 10, 50, 51]. Although multiple data sources are available, most methods only access one or two of these databases, which all have their limitations. Chen [52] proposed a method, called BRIDGE, which utilize a multiple regression model with lasso penalty to prioritize the candidate genes by integrating disease phenotype similarity. Zhang [53] adopted a Bayesian regression approach to integrate multiple PPI networks. The approach takes the strength of association between a query disease and a candidate gene as a score to prioritize candidate genes. However, to the best of our knowledge, constructing and integrating multiple gene similarity networks for prioritizing disease genes has not been investigated well. As a result, there is still a need for the improvement in these disease gene prioritization methods.

Motivated by the observations above, we proposed the random walk with restart on phenotype-gene bilayer network (RWRB) algorithm to prioritize candidate genes of diseases. We firstly construct five individual gene similarity networks based on genomic data of genes. Then we obtain an integrated gene similarity network (IGSN) via the similarity network fusion (SNF) method. After that, combining the phenotype similarity network, phenotype-gene association network and IGSN, a phenotype-gene bilayer network is constructed. In the end, we employ the RWRB algorithm on the phenotype-gene bilayer network and prioritize candidate disease genes on the whole genomic scale. On the benchmark datasets, RWRB performs better than other leading approaches. The framework of our proposed method is shown in Fig. 1. It is noteworthy that, to take advantage of more abundant genome data related to genes, we treat sequence and domain similarity between proteins as the similarity between their corresponding protein-coding genes. Therefore, the similarity between genes or proteins is collectively called gene similarity to simplify in this article.

Methods

Datasets

Phenotype similarity network

In OMIM database, a phenotype is defined as a MIM record. The similarity between phenotypes has been calculated by text mining of MIM records [54]. We downloaded the phenotype similarity network [39], which contains pairwise similarity scores for 5080 phenotypes, covering the majority of recorded human phenotypes in this database.

Phenotype-gene association network

The phenotype-gene relationship data is downloaded from the OMIM database (http://omim.org/). After filter out phenotypes which do not belong to the phenotype similarity network above and have no known disease genes, we collect 2133 phenotypes and 1893 disease genes involving 2386 phenotype-gene associations totally.

Gene data

Gene Ontology (GO) and Gene Ontology Annotation (GOA) data of human is download from the GO website (http://geneontology.org, dated November 2, 2015). The numbers of annotated genes in cellular component (CC), molecular function (MF) and biological process (BP) ontologies are 16,938, 18,225, and 17,072, respectively. Here, we consider all types of annotations which contains Inferred from Electronic Annotations. Amino acid sequences of proteins are obtained from the UniProt database [55]. The number of protein sequence in human database is 18,830. Domains of proteins are downloaded from PFAM database (http://www.sanger.ac.uk/Software/Pfam) [56]. Here, we only collected Pfam-A, a collection of manually curated and functionally assigned domains, instead of Pfam-B, which is computationally derived collection of domains, to ensure accuracy in measuring the similarity between proteins. The number of human proteins annotated by Pfam-A is 18,523 involving 5333 kinds of domains in this database.

Construction of gene similarity networks based on genomic data of genes

Constructing gene functional similarity networks based on gene ontology

GO is a standardized and controlled vocabulary to describe genes and gene product attributes. It comprises three orthogonal ontologies: CC, MF and BP, respectively. In our research, CC, MF and BP ontology has 3817, 9943 and 27,864 terms, respectively.

Functional similarity between genes can be inferred from the semantic relationships of their GO terms [51, 57]. In this work, the functional similarity between two genes is measured by Wang method [58] taking BMA strategy because of its an outstanding performance. For the sake of three ontologies are independent, the functional similarity between genes can be measured from three different ontologies. Therefore, we obtain network on CC, MF and BP ontology, respectively.

Constructing protein similarity network based on protein sequence

We used bitscores calculated by the Basic Local Alignment Search Tool (BLAST) to create our sequence homology dataset. First of all, we performed an all-versus-all comparison between proteins with an expectation-value threshold of 10⁻⁶. Then, the similarity between proteins was normalized according to their corresponding bitscores of proteins. Then, applying this operation to all protein pairs, we got the similarity network of protein sequences.

Constructing protein similarity network based on protein domains

We calculated the Jaccard scores [59] between protein domain set as domain similarity of proteins. The Jaccard score between proteins p ₁ and p ₂ is defined as $ {D}_{p_1}\cap {D}_{p_2}/{D}_{p_1}\cup {D}_{p_2} $, which is the ratio of the number of common domains between p ₁ and p ₂ over the total number of domains in p ₁ and p ₂. D _p denotes the domain set of proteinp. There are totally 18,526 proteins involving 5333 kinds of domain used in our analysis. Applying this operation to all protein pairs, thus we constructed a domain similarity network.

The overlap among the five aspects of annotation information about genes (proteins) above is unexpectedly large, as shown in Fig. 2. Numbers in the figure denote the number of genes that annotated by the corresponding information in each part, where CC, MF and BP denote corresponding annotations of genes. Seq and Domain denote amino acid sequences and domain of proteins.

Integrating gene similarity networks based on SNF method

We have constructed five gene similarity networks based on BP, CC, MF, sequence and domain information of genes. In this subsection, we will employ SNF method [60] to integrate these five networks.

Suppose W ^(m) (here m = 1,2,3,4,5) denotes one of the adjacent matrices of gene similarity networks, we use Eq. (1) to compute the normalized weighted matrix of W ^(m), which can be defined as:

$$ {P}_{ij}^{(m)}=\left\{\begin{array}{l}\frac{W_{ij}^{(m)}}{2{\sum}_{k\ne i}{W}_{ik}^{(m)}}\kern0.5em if\ j\ne i\\ {}\kern2em \frac{1}{2}\kern1em if\ j=i\ \end{array}\right. $$

(1)

The normalization used here is free of the scale of self-similarity in the diagonal entries. It can avoid numerical instabilities and ∑_j P(i, j) = 1 still holds.

At the same time, we define the local kernel matrix $ {S}_{i,j}^{(m)} $, which is calculated by Eq. (2)

$$ {S}_{ij}^{(m)}=\Big\{{\displaystyle \begin{array}{l}\frac{W_{ij}^{(m)}}{\sum_{k\in {V}_i^{(m)}}{W}_{ik}^{(m)}}\kern0.5em ifj\in {V}_i^{(m)},\\ {}\kern2em 0\kern2.5em \mathrm{otherwise}\end{array}}\operatorname{} $$

(2)

where $ {V}_i^{(m)} $ denotes a set which contains K nearest neighbors of gene i in the matrix W ^(m). Since local similarities (high values) are more reliable than remote ones, we filter out the low similarity neighbors and set these similarities to zero. The K most similar genes for each gene in the networks are preserved. The local neighborhoods are further exploited to measure the local affinities among genes [61]. Therefore, S ^(m) keeps the local structure of W ^(m).

In summary, P ^(m) carries the full information about the similarity of each gene to all others, whereas S ^(m) only encodes the similarity to the K most similar genes. Here, P ^(m) and S ^(m) are called status matrices and kernel matrix [60], respectively.

To fuse the similarity networks, SNF takes the interactive process of the following update equation:

$$ {P}_{t+1}^{(m)}={S}^{(m)}\times \left(\frac{1}{M-1}\sum_{n\ne m}{P}_t^{(n)}\right)\times {\left({S}^{(m)}\right)}^T $$

(3)

where m is the index of corresponding adjacent matrices of similarity networks, and t is the iteration number. It should be noted that we perform normalization on $ {P}_{t+1}^{(m)} $ as in Eq. (1) after each iteration. Another way to think of the updating rule (3) is

$$ {P}_{t+1}^{(m)}\left(i,j\right)=\sum_{h\in {V}_i^{(m)}}\sum_{l\in {V}_j^{(m)}}{S}_{i,h}^{(m)}\times {\left(\frac{1}{M-1}\sum_{n\ne m}{P}_t^{(n)}\right)}_{h,l}\times {S}_{j,l}^{(m)} $$

(4)

Because the similarity information is only propagated through the common neighborhood between genes, SNF is robust to noise existing in genome data. Besides, if two genes g _i and g _j have common neighbors in all of similarity matrices, it should be well believed that they have the high similarity. What’s more, SNF benefits the fact that even if g _i and g _j are not very similar in one data type, their similarity can be measured in another data type and this similarity information can be propagated through the fusion process [60, 62]. The illustrative example for fusing two networks based on SNF is shown in Fig. 3.

Finally, after t steps of iteration, these five matrices will converge to a single integrated matrix, which can be computed as:

$$ P=\frac{1}{M}\sum_{m=1}^M{P}_t^{(m)} $$

(5)

We obtain the primary integrated gene similar network in this step.

Clustering coefficient-based threshold selection

The five gene similarity networks are fused as a primary integrated gene similar network, whose nodes represent the genes and edges represent the similarity between genes. However, there is still a serious problem needing to be addressed that how similar between two genes can be connected in the network. Because most molecular networks follow a power law or lognormal distribution [12], we should set an appropriate threshold to ensure that the primary integrated gene similarity network meets this demand. The similarity between genes which is greater than the proper threshold will be connected by edges. Otherwise, the similarity will be set to zero [46]. In this research, we adopt the clustering-coefficient-based threshold selection method to select a proper threshold for the primary integrated gene similarity network.

The clustering coefficient of a gene i in the network is defined as:

$$ {C}_i=\frac{2{E}_i}{k_i\left({k}_i-1\right)} $$

(6)

where E _i represents the number of edges between the k _i (>1) first neighbors of gene i. The clustering coefficient of a network is defined as the average clustering coefficient of its all nodes.

$$ C=\frac{1}{K}\sum_{k_i>1}{C}_i $$

(7)

where K denotes the total number of nodes in the network.

The threshold selection for a network can be regarded as a process, where edges are removed from the initially complete graph by gradually increasing the similarity threshold between genes. For each threshold r, we can construct a network by the means of filtering out the similarity lower than the threshold r. It is generally believe that the clustering coefficient of molecular networks, denoted byC(r), should be significantly higher than the that of the corresponding random network, which is denoted by C ₀(r).

Therefore, we formulate a discrete optimization problem, in which the cutoff threshold should meet the demand

$$ {C}^{\ast }=\underset{j}{\min}\left\{{r}_j:C\left({r}_j\right)-{C}_0\left({r}_j\right)>C\left({r}_{j+1}\right)-{C}_0\left({r}_{j+1}\right)\right\} $$

(8)

over a set of thresholds 0 = r ₀ < r ₁ < ⋯ < r _j − 1 < r _J = 1. In Eq. (8), r _j + 1 = r _j + 0.001; C(r)and C ₀(r) denote the clustering coefficients of the gene similarity network and the corresponding random network at the threshold r, respectively. The aim of this procedure is to find the first local maximum, which means the first stop of monotonically increasing of C(r _j) − C ₀(r _j).

On the other hand, the clustering coefficient of a corresponding random network is determined by

$$ {C}_0=\frac{{\left({\overline{k}}^2-\overline{k}\right)}^2}{{\overline{k}}^3N} $$

(9)

where N is the total number of nodes in a network, $ \overline{k}=1/N{\sum}_{i=1}^N{k}_i $ $ \overline{k^2}=1/N{\sum}_{i=1}^N{k}_i^2 $.

Finally, after threshold selection for the primary integrated gene similarity network, the IGSN that we need is constructed. It is represented as G(V, E, t), where V = { g ₁, g ₂, ⋯ , g _N} denotes the genes involving in IGSN, and E = {e _ij = ≺ g _i, g _j ≻ |sim(g _i, g _j) > t} represents the edges between genes with values greater than threshold t.

Construction of the phenotype-gene bilayer network

We have got three networks, which are phenotype similarity network, IGSN and phenotype-gene association network respectively. In this subsection, we make use of the three networks above to construct a phenotype-gene bilayer network. The construction process of phenotype-gene bilayer network is illustrated in Fig. 4.

Suppose A _P(m × m), B _GP(m × n) and W _G(n × n) are adjacency matrices for phenotype similarity network, phenotype-gene association network and IGSN respectively, where m and n represent the number of phenotypes and genes in their respective networks. The adjacency matrix of the phenotype-gene bilayer network is denoted as

$ A=\left[\begin{array}{cc}\hfill {A}_P\hfill & \hfill {B}_{GP}\hfill \\ {}\hfill {B}_{PG}\hfill & \hfill {W}_G\hfill \end{array}\right] $ (10)

where B _GP is the transpose of B _PG.

Prioritizing candidate disease genes based on RWRB

The RWRB is a ranking algorithm, which simulates a random walker moving from the seed nodes to their immediate neighbors randomly and staying at the current node(s) based on the probability transition matrix [32, 63]. As for a random walk on the bilayer network, we first construct the transition matrix M based on matrix A, which is defined as

$$ M=\left[\begin{array}{cc}\hfill \lambda {M}_P\hfill & \hfill \left(1-\lambda \right){M}_{GP}\hfill \\ {}\hfill \left(1-\lambda \right){M}_{PG}\hfill & \hfill \lambda {M}_G\hfill \end{array}\right] $$

(11)

where M _P, M _GP and M _G are the row-normalizing matrices of A _P, B _PG and W _G respectively; λ controls the jumping probability between two similarity networks, which are phenotype similarity network and IGSN. Then the initial vector P(0) (at t = 0) can be defined as follows:

$$ P(0)=\left[\begin{array}{l}\left(1-\eta \right)u(0)\\ {}\kern1em \eta v(0)\end{array}\right] $$

(12)

where u(0) and v(0) denote the initial probability vector for phenotype similarity network and IGSN. The parameter η ∈ (0, 1) is used to weight the importance of phenotype similarity network and IGSN. The effect of the parameters λ and η on RWRB will be shown in the result section. P(t) represents a vector in which the i-th element holds the probability of finding the random walker on node i at step t.

Based on the vector P(0), P(t) and the transition matrix M, the probability vector at step t + 1 can be given by

$$ P\left(t+1\right)=\left(1-\gamma \right){M}^TP(t)+ rP(0) $$

(13)

where γ ∈ (0, 1) indicates the restart probability. At each step, the random walker has a probability γ to return the seed nodes.

After some steps, the walking process is converged if the change between P(t) and P(t + 1) is lower than 10⁻⁶. The steady probability P(∞) is represented as $ P\left(\infty \right)=\left[\begin{array}{l}\left(1-\eta \right)u\left(\infty \right)\\ {}\kern1em \eta v\left(\infty \right)\end{array}\right] $. As a result, genes which belong to the control set are ranked according to their probability scores in P(∞). Gene which has the maximum in P(∞) among all the control gene set is considered as the most probable gene that associates the phenotype.

Evaluation metrics of prediction performance

Phenotypes in OMIM database mainly have three types [33, 35]: susceptible chromosomal locus and several related disease genes are known; susceptible locus is known, but no related genes are known; locus and related causal genes are unknown, but the phenotype is known. Therefore, we use three leave-one-out cross-validation experiments, i.e. linkage interval, genome-wide scan and ab initio, which are detailedly introduced and used in [35, 43], to validate our method.

Firstly, as for some phenotypes that susceptible chromosomal locus and several related disease genes are known, we take the cross validation against a linkage interval experiment [43]. In each round of validation, one phenotype-gene link is removed. We define the gene associates with the removed link as the held out gene. The phenotype and the rest disease genes related to this phenotype are used as the seed nodes. At the same time, we define the control gene set that consists of the held out disease gene and its 99 nearest genes according to the NCBI refGene location. The performance of RWRB is investigated by the capability to recover the held out disease gene from the control gene set. We call this as linkage interval experiment.

Secondly, since there are some phenotypes that have no susceptible chromosomal locus but have already experimental validated disease genes, we take the validation against genes in the genome-wide scale. In this experiment, we also remove a phenotype-gene relationship and use the rest disease gene associated with this phenotype as the seed nodes. Different to linkage interval experiment, the control gene set consists all the genes in the genome-wide scale except the held out disease gene. The performance of RWRB is investigated by the rank of held out gene in the control gene set. We call this as the genome-wide scan experiment.

Thirdly, as for some phenotypes without any known disease genes and susceptible chromosomal locus, we identify disease genes for these kinds of phenotypes from the whole-genome scale. In this experiment, we first remove all the associations between this phenotype and its disease genes, then run the RWRB algorithm which treats this phenotype as seed node. In this situation, the control gene set is defined as all the genes that in the whole networks. Similar to genome-wide scan experiment, the performance of RWRB is investigated by the rank of held out gene in the control gene set. We call this as ab initio experiment. The detail explanations for the three approaches have been described by Li [35] and Jiang [37].

At the same time, we also define three metrics to investigate the performance of RWRB. First is number of successful predictions (NSP). For each experiment above, in each round of validation, if the held out disease gene is ranked as top 1 among the control gene set, we consider it a successful prediction. Further, for a set of validation runs in each experiment, we sum up the number of successful predictions and treat it as a metric that represents effectiveness of algorithms. Second is the mean rank ratio (MRR), which is defined as the average rank ratios of all held genes in control gene sets in all validation runs. Third is the receiver operation characteristic (ROC) curve. We plots the sensitivity versus 1-specificity which subject to the threshold separating the prediction classes [10]. Sensitivity refers to the percentage of disease genes that are ranked above a particular threshold, while specificity refers to the fraction of control genes rank below the threshold. We vary the threshold from 0.0 to 1.0 with the scale 0.01, and draw the ROC curve. It is well accepted that smaller MRR and larger AUC and NSP values indicate better performance for a prioritization method [43].

Results

First of all, we will investigate the performance of RWRB on three kinds of experiments. Then, we assess the effect of parameters in RWRB algorithm. After that, the proposed algorithm is compared with two similarity-based methods, which are CIPHER [33] and RWRH [35] and two feature-based methods which are PUDI [15] and PriDiGe [14]. Finally, we predict novel causal genes for Alzheimer’s disease and other common diseases based on RWRB algorithm.

The performance of RWRB

In this subsection, we will investigate the performance of RWRB on the three experiments using the three metrics. The detail results are shown in Table 1. The ROC curves on linkage interval and genome-wide scan experiments are shown in Fig. 5.

Table 1 The results of RWRB on the three experiments

Full size table

As is shown in Table 1, the results of RWRB on NSP, MRR and AUC metrics for linkage interval experiment is 1384, 18.28, 0.8505, respectively. Then we further investigate the performance of RWRB on genome-wide scan experiment and obtain a NSP of 311, a MRR of 22.17 and an AUC of 0.8417. In the end, we perform the cross-validation approach against ab initio experiment. The results on NSP, MRR and AUC are 223, 29.64 and 0.8144, respectively.

As is known to us, a random guess will yield a MRR of 50%, and an AUC of 50%, suggesting that the effectiveness of RWRB in uncovering disease gene. Meanwhile, the results also show the reliability of IGSN.

Then we further analyze the detail distribution of disease genes ranked in the control gene set for linkage interval and genome-wide scan experiment. The results are presented in Fig. 6. As for linkage interval experiment, we find that there are 1554 disease genes ranking in top 10, where 1384 disease genes are rank one. 230 disease genes are ranked between 11 and 20, and 157 disease genes are ranked between 30 and 50.

As for genome-wide scan experiment, there are 498 disease genes ranking between 1 and 10. The number of disease genes between 11 and 50 is 422. As we can see from the results, most of held out genes can be rank in top 100. The results on three experiments demonstrate that RWRB has a high accuracy in inferring disease genes on the genomic scale.

Effect of parameters on RWRB

There are totally three parameters in RWRB, which are γ λ and η. The parameter γ denotes the restart probability in Eq. (1). It has been well accepted that the parameter γ has a slight effect on the results and here we fix it at 0.7 [35]. Next, we will investigate the influence of parameter λ and η for RWRB on the NSP metric.

The parameter λ represents the jumping probability between phenotype similarity network and IGSN. According to [35], larger λ will introduce more mutual information between phenotype similarity network and IGSN. To investigate the effect of this parameter on the performance of RWRB, we tested our algorithm on different values of λ ranging from 0.1 to 0.9 with an increment of 0.1.

Results are shown in Table 2. The performance is improved with the increase from 0.1 to 0.6 on the whole. However, the performance is slightly decreased from 0.6 to 0.9. As for the linkage interval experiment, RWRB gets the best performance at λ = 0.6, while RWRB gets the largest NSP at λ = 0.7 on the genome-wide scan experiment. The best results for ab initio experiment is 225 when λ = 0.6. Therefore, we suggest that the best λ value is 0.6 or 0.7 for RWRB on the experiments above. Results demonstrate that the RWRB algorithm successfully makes the best use of the relationships between phenotype similarity network and IGSN.

Table 2 Performances of RWRB at different values of λ on NSP metric

Full size table

As is known to us, η controls the impact of seed phenotypes and seed genes in the initial vector. To validate the effect of parameter η on RWRB, we tested our algorithm on different values of η ranging from 0.1 to 0.9 with the scale 0.1. We run RWRB on linkage interval and genome-wide scan, ab initio experiments, and evaluate its performance on the NSP metric. As is shown in Table 3, the performance is improved with the increase from 0.1 to 0.6 on both experiments. However, the performance is slightly decreased from 0.6 to 0.9. As a result, the algorithm performs best when η at 0.6. This suggests that IGSN is more importance than phenotype similarity network for RWRB.

Table 3 Performances of RWRB at different values of η on NSP metric

Full size table

Comparison with similarity-based methods

We compare RWRB with similarity-based methods which are RWRH [35] and CIPHER [33], respectively. The author [33] defines two topological distance on the basis of two different neighborhood systems: shortest path (SP) and direct neighbor (DN). Therefore, two versions of CIPHER are represented as CIPHER-SP and CIPHER-DN, respectively. The results of each method on NSP metric are presented in Table 4.

Table 4 The performance of each method on the NSP metric

Full size table

Because the number of phenotype-gene associations in RWRB, CIPHER and RWRH models are different, we compute the successful prediction percentages for each method, which is defined as the ratio between NSP and the total phenotype-gene associations in their corresponding datasets. The experimental results are listed in Table 5.

Table 5 The successful prediction percentages for each method

Full size table

As for the linkage interval experiment, RWRB gets 1384 successful predictions, while RWRH, CIPHER-SP and CIPHER-DN obtain 814, 709, 765 successful predictions, respectively. The percentage of successful prediction for RWRB is 0.58 which is the highest in all three methods.

As for the genome-wide scan experiment, the control gene set is defined as the whole genes in IGSN. RWRB get 311 successful predictions, while RWRH, CIPHER-SP and CIPHER-DN obtain 245, 153, 165 suc cessful predictions, respectively. Then number of successful predictions of RWRB is largest in the three methods. However, the percentage of successful predictions for RWRB is 0.13 which is lower than that of RWRH (0.17).

On the ab initio experiment, there are 223 successful predictions by RWRB, while RWRH, CIPHER-SP and CIPHER-DN successfully predicted 201,140 and 157 cases, respectively. However, the percentage of successful predictions of RWRH is the highest in the three methods which is 0.14, whereas the other three methods are almost neck and neck.

Comparison with feature-based methods

At the same time, we compare RWRB with two feature based methods which are PUDI [15] and PriDiGe [14]. Here we only compare the precision (p), recall (r) and F-measure (F) of these three methods, since they are from different type of methods.

The metrics about precision, recall and F-measure for PUDI and ProDiGe have been introduced by Yang [15]. Here, we will also use these metrics to evaluate the performance of RWRB on linkage interval experiment. In the experiment, we take the leave-one-out cross-validation method. For the precision of RWRB, we define it as the ratio between NSP and the number of all validation runs. For the recall of RWRB, we define it as the ratio between the number of the held out genes whose rank proportions are higher than 0.5 and the number of all the held out genes. The F-measure is the harmonic mean of precision and recall, which is defined as F = 2^∗ p ^∗r/(p + r).

The results for PUDI, ProDiGe and RWRB are shown in Table 6. From the results, we can find that RWRB achieves 82.3% recall and ranks first in the three methods. Method PUDI wins 72.3% precision which is 13.7 and 0.9 better than RWRB and ProDiGe method, respectively. At the same time, method PUDI achieves 76.5% F-measure which is 2.0% and 10.2% better than RWRB and ProDiGe method, respectively. In this group experiment, method PUDI performs best and RWRB ranks second overall.

Table 6 Overall comparison among different methods

Full size table

Prioritizing Alzheimer’s disease and other common disease genes by RWRB: A case study

In this subsection, we will use RWRB to predict novel causal genes of interested diseases. To validate the effectiveness of our method, we will check whether our predicted disease genes have been already found to associate with the diseases in literature. Here, we select 16 multifactorial diseases which are used in [37] and list the top 10 candidate genes for each disease. The results are shown in Table 7. Here, we only select Alzheimer’s disease (AD) as the case study to verify the performance of RWRB.

Table 7 Top-10 predicted causal genes of 16 multifactorial diseases

Full size table

AD is a progressive disease that usually starts slowly and gets worse over time. In general, it causes 60% to 70% of cases of dementia. The cause of AD has not been completely understood so far. The primary task is to discover the disease genes to understand the nosogenesis of genetic disease. There are many phenotypes for AD. Here we select 104,300 as target phenotype to prioritize disease gene. The corresponding susceptible region for MIM:104,300 is 6p22.

As is shown in Table 7, the first prediction of RWRB for MIM:104,300 is NOS2, which plays an important role in neuroinflammation by generating nitric oxide (NO), a critical signaling and redox factor in the brain [64]. Further, the levels of NO fall in the brain to a threshold may promote Aβ mediated damage. The predicted gene NOS2 has a large impact on AD. The second prediction gene for MIM:104,300 is NOS1. In the brain and peripheral nervous system, nitric oxide displays many properties for a neurotransmitter. The author [65] suggests that short alleles of the NOS1 exon 1f–VNTR interacting with the epsilon 4 allele tend to markedly increase the AD risk [65]. The fourth predicted gene for AD is APBB1. A trinucleotide deletion of the APBB1 gene was a factor protecting against late-onset AD. Cousin [66] reported the results of a case/control study and confirmed this relationship. The eighth prediction is gene PGBD1. It locates at 6p22 which is the suspectable region of MIM:104,300. What’s more, it currently shows significant association in AlzGene according to Genome-wide association study. Its gene product is specifically expressed in the brain and has been identified as the key factors of AD. The results above show that the combination of the similarity network integration and the identification algorithm can successfully predict candidate genes for interested disease.

Conclusions and discussion

In this paper, we propose a novel method, named RWRB, to infer causal genes of interested diseases. We firstly construct five gene similarity networks based on five different types of genome data. Then we employ SNF method to integrate these gene similarity networks and get IGSN. After that, we perform RWRB to prioritize disease genes. RWRB is compared with the state-of-the-art models and achieves a better performance on most evaluation metrics. Next, we will discuss the highlights of this article.

The advantages of IGSN

The main object of our research is to overcome two drawbacks of current PPI networks, i.e., their low reliability and coverage. As a result, we construct the IGSN in this research. Firstly, since IGSN is fused based on the five gene (protein) similarity networks, its reliability should be higher than existing that of PPI networks. The prioritization of disease genes can be benefited from IGSN. Secondly, IGSN can significantly improve the coverage of human genes comparing current PPI networks. It covers 19,065 genes, which is twice the number of genes in HPRD network. Therefore, the number of phenotype-gene associations in RWRB algorithm is 2386, which is almost twice that in RWRH and CIPHER methods whose number is 1444. As a result, the proposed method can make the best use of phenotype-gene associations in OMIM database. Thirdly, since IGSN is a single network which integrates multiple gene similarity networks, there is no need for it to assign weight values to different subnetworks.

Threshold selection for IGSN

The threshold selection is very important to the quality of IGSN. This is because the threshold affects reliability of IGSN, and may further determine the performance of RWRB. As shown in Fig. 7, the first stop of monotonically increasing of DCC (See legend of Fig. 7) occurs at r = 0.005, which indicates that this threshold is the most appropriate value to construct IGSN. Under this threshold, the IGSN has 19,065 genes in our experiment.

We further investigate the degree distributions of IGSN under the selected threshold. Many previous studies [67] have found that distribution of node connectivity of molecular networks follows a power law. However, some other research [68] argued that there are some distributions, such as the lognormal distribution, which can also depict the degree distribution better than power law. In this research, we employ two models, which are Gaussian distribution and Lognormal distribution, to investigate the distributions of IGSN. In order to increase contrast, we import two other leading PPI networks, which are BioGRID and HPRD networks.

The fitting performance on the distributions for each network is represented by R-squares (R²). R² provides a measure of how well the data fits a certain model. As is shown in Fig. 8, we find that the degree of IGSN fits the lognormal distribution best, while BioGRID and HPRD prefer to fit the power law distribution. As is shown in Fig. 8 (c) and (d), the R² results of IGSN for Gaussian and Lognormal distribution are 0.87 and 0.94, respectively. The R² results of BioGRID and HPRD for fitting Power law are 0.91 and 0.92, respectively, which are shown in Fig. 8 (a) and (b). The degree distribution result shows that IGSN has the characteristics of molecular networks, rather than those of random networks. Therefore, IGSN is a meaningful biological network.

In the future, our research should further be improved from the following aspects. First, other genomic data of genes needs to be integrated. Although we have measured the similarity between genes based on five types of genome data, other information of genes is needed to be integrated to the similarity networks. Second, how to fuse the different similarity networks properly is important to the ultimate integrated network. Many previous studies have attempted to integrate different semantic similarity network and gene expression networks. However, some methods only assign equal weighted to these networks and simply add them together, while some others apply these networks separately. The SNF method used in this article may overcome the drawbacks above. However, the identification of the integrated network is not a trivial assessment because there is no direct way to ascertain its rationality and correctness. In our research, we resort to degree distribution of integrated network and find it fit the lognormal distribution best. This only shows the rationality from one property of the integrated network. Therefore, we need to study more fused methods of network further and make the integrated network be in line with the characteristics of biological networks.

References

Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R. Associating genes and protein complexes with disease via network propagation. PLoS Comput Biol. 2010;6(1):e1000641.
Article MathSciNet Google Scholar
Lage K, Karlberg EO, Storling ZM, Olason PI, Pedersen AG, Rigina O, Hinsby AM, Tumer Z, Pociot F, Tommerup N, et al. A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol. 2007;25(3):309–16.
Article Google Scholar
Barabasi AL, Gulbahce N, Loscalzo J. Network medicine: a network-based approach to human disease. Nat Rev Genet. 2011;12(1):56–68.
Article Google Scholar
Mazza A, Klockmeier K, Wanker E, Sharan R. An integer programming framework for inferring disease complexes from network data. Bioinformatics. 2016;32(12):i271–7.
Article Google Scholar
Zhou H, Skolnick J. A knowledge-based approach for predicting gene-disease associations. Bioinformatics. 2016;32(18):2831–8. doi:10.1093/bioinformatics/btw358.
Article Google Scholar
Li Y, Li J. Disease gene identification by random walk on multigraphs merging heterogeneous genomic and phenotype data. BMC Genomics. 2012;13(Suppl 7):S27.
Article Google Scholar
Nitsch D, Goncalves JP, Ojeda F, de Moor B, Moreau Y. Candidate gene prioritization by network analysis of differential expression using machine learning approaches. BMC Bioinformatics. 2010;11:460.
Article Google Scholar
Botstein D, Risch N. Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet. 2003;33(Suppl):228–37.
Article Google Scholar
Jessulat M, Pitre S, Gui Y, Hooshyar M, Omidi K, Samanfar B, Tan LH, Alamgir M, Green J, Dehne F, et al. Recent advances in protein-protein interaction prediction: experimental and computational methods. Expert Opin Drug Discov. 2011;6(9):921–35.
Article Google Scholar
Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, Tranchevent LC, De Moor B, Marynen P, Hassan B, et al. Gene prioritization through genomic data fusion. Nat Biotechnol. 2006;24(5):537–44.
Article Google Scholar
Antanaviciute A, Daly C, Crinnion LA, Markham AF, Watson CM, Bonthron DT, Carr IM. GeneTIER: prioritization of candidate disease genes using tissue-specific gene expression profiles. Bioinformatics. 2015;31(16):2728–35.
Article Google Scholar
Wang X, Gulbahce N, Yu H. Network-based methods for human disease gene prediction. Briefings Functional Genomics. 2011;10(5):280–93.
Article Google Scholar
Peng J, Bai K, Shang X, Wang G, Xue H, Jin S, Cheng L, Wang Y, Chen J. Predicting disease-related genes using integrated biomedical networks. BMC Genomics. 2017;18(1):1043.
Article Google Scholar
Mordelet F, Vert J-P. ProDiGe: prioritization of disease genes with multitask machine learning from positive and unlabeled examples. BMC Bioinformatics. 2011;12(1):389.
Article Google Scholar
Yang P, Li XL, Mei JP, Kwoh CK, Ng SK. Positive-unlabeled learning for disease gene identification. Bioinformatics. 2012;28(20):2640–7.
Article Google Scholar
Chen Y, Wu XB, Jiang R. Integrating human omics data to prioritize candidate genes. BMC Med Genet. 2013;6:57.
Google Scholar
Dezso Z, Nikolsky Y, Nikolskaya T, Miller J, Cherba D, Webb C, Bugrim A. Identifying disease-specific genes based on their topological significance in protein networks. BMC Syst Biol. 2009;3:36.
Article Google Scholar
Li ZC, Lai YH, Chen LL, Xie Y, Dai Z, Zou XY. Identifying and prioritizing disease-related genes based on the network topological features. Biochim Biophys Acta. 2014;1844(12):2214–21.
Article Google Scholar
Jiang R, Wu M, Li L. Pinpointing disease genes through phenomic and genomic data fusion. BMC Genomics. 2015;16(Suppl 2):S3.
Article Google Scholar
Li J, Lin X, Teng Y, Qi S, Xiao D, Zhang J, Kang Y. A comprehensive evaluation of disease phenotype networks for gene prioritization. PLoS One. 2016;11(7):e0159457.
Article Google Scholar
Chen B, Li M, Wang J, Shang X, Wu F-X. A fast and high performance multiple data integration algorithm for identifying human disease genes. BMC Med Genet. 2015;8(Suppl 3):S2.
Google Scholar
Yuan Q, Gao J, Wu D, Zhang S, Mamitsuka H, Zhu S. DrugE-rank: improving drug-target interaction prediction of new candidate drugs or targets by ensemble learning to rank. Bioinformatics. 2016;32(12):i18–27.
Article Google Scholar
George RA, Liu JY, Feng LL, Bryson-Richardson RJ, Fatkin D, Wouters MA. Analysis of protein sequence and interaction data for candidate disease gene prediction. Nucleic Acids Res. 2006;34(19):e130.
Article Google Scholar
Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS. Speeding disease gene discovery by sequence based candidate prioritization. BMC bioinformatics. 2005;6:55.
Article Google Scholar
López-Bigas N, Ouzounis CA. Genome-wide identification of genes likely to be involved in human genetic disease. Nucleic Acids Res. 2004;32(10):3108–14.
Article Google Scholar
Perez-Iratxeta C, Bork P, Andrade-Navarro MA. Update of the G2D tool for prioritization of gene candidates to inherited diseases. Nucleic Acids Res. 2007;35(suppl 2):W212–6.
Article Google Scholar
Schlicker A, Lengauer T, Albrecht M. Improving disease gene prioritization using the semantic similarity of gene ontology terms. Bioinformatics. 2010;26(18):i561–7.
Article Google Scholar
Oti M, Snel B, Huynen MA, Brunner HG. Predicting disease genes using protein-protein interactions. J Med Genet. 2006;43(8):691–8.
Article Google Scholar
Franke L, van Bakel H, Fokkens L, de Jong ED, Egmont-Petersen M, Wijmenga C. Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. American Journal Human Genetics. 2006;78(6):1011–25.
Article Google Scholar
Xu J, Li Y. Discovering disease-genes by topological features in human protein-protein interaction network. Bioinformatics. 2006;22(22):2800–5.
Article Google Scholar
McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, Hirschhorn JN. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008;9(5):356–69.
Article Google Scholar
Kohler S, Bauer S, Horn D, Robinson PN. Walking the interactome for prioritization of candidate disease genes. American Journal Human Genetics. 2008;82(4):949–58.
Article Google Scholar
Wu X, Jiang R, Zhang MQ, Li S. Network-based global inference of human disease genes. Mol Syst Biol. 2008;4:189.
Article Google Scholar
Wu X, Liu Q, Jiang R. Align human interactome with phenome to identify causative genes and networks underlying disease families. Bioinformatics. 2009;25(1):98–104.
Article Google Scholar
Li Y, Patra JC. Genome-wide inferring gene-phenotype relationship by walking on the heterogeneous network. Bioinformatics. 2010;26(9):1219–24.
Article Google Scholar
Menche J, Sharma A, Kitsak M, Ghiassian SD, Vidal M, Loscalzo J, Barabási A-L. Uncovering disease-disease relationships through the incomplete interactome. Science. 2015;347(6224):1257601.
Article Google Scholar
Zhu J, Qin Y, Liu T, Wang J, Zheng X. Prioritization of candidate disease genes by topological similarity between disease and protein diffusion profiles. BMC Bioinformatics. 2013;14(Suppl 5):S5.
Google Scholar
Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL. The human disease network. Proc Natl Acad Sci U S A. 2007;104(21):8685–90.
Article Google Scholar
van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA. A text-mining analysis of the human phenome. Eur J Hum Genet. 2006;14(5):535–42.
Article Google Scholar
Piro RM, Di Cunto F. Computational approaches to disease-gene prediction: rationale, classification and successes. FEBS J. 2012;279(5):678–96.
Article Google Scholar
Vidal M, Cusick ME, Barabasi A-L. Interactome networks and human disease. Cell. 2011;144(6):986–98.
Article Google Scholar
Zhang S-H, Wu C, Li X, Chen X, Jiang W, Gong B-S, Li J, Yan Y-Q. From phenotype to gene: detecting disease-specific gene functional modules via a text-based human disease phenotype network construction. FEBS Lett. 2010;584(16):3635–43.
Article Google Scholar
Jiang R, Gan M, He P. Constructing a gene semantic similarity network for the inference of disease genes. BMC Syst Biol. 2011;5(2):1.
Google Scholar
Gill N, Singh S, Aseri TC. Computational disease gene prioritization: an appraisal. J Comput Biol. 2014;21(6):456–65.
Article MathSciNet Google Scholar
Goncalves JP, Francisco AP, Moreau Y, Madeira SC. Interactogeneous: disease gene prioritization using heterogeneous networks and full topology scores. PLoS One. 2012;7(11):e49634.
Article Google Scholar
Xu Y, Guo M, Liu X, Wang C, Liu Y. Inferring the soybean (Glycine max) microRNA functional network based on target gene network. Bioinformatics. 2014;30(1):94–103.
Article Google Scholar
Yu G, Fu G, Wang J, Zhu H. Predicting protein function via semantic integration of multiple networks. IEEE/ACM Trans Comput Biol Bioinform. 2016;13(2):220–32.
Article Google Scholar
Peng J, Li H, Liu Y, Juan L, Jiang Q, Wang Y, Chen J. InteGO2: a web tool for measuring and visualizing gene semantic similarities using gene ontology. BMC Genomics. 2016;17(Suppl 5):530.
Article Google Scholar
De Las RJ, Fontanillo C. Protein–protein interactions essentials: key concepts to building and analyzing interactome networks. PLoS Comput Biol. 2010;6(6):e1000807.
Article Google Scholar
Yu G, Zhu H, Domeniconi C, Guo M. Integrating multiple networks for protein function prediction. BMC Syst Biol. 2014;9(1):1–11.
Google Scholar
Peng J, Uygun S, Kim T, Wang Y, Rhee SY, Chen J. Measuring semantic similarities by combining gene ontology annotations and gene co-function networks. BMC Bioinformatics. 2015;16:44.
Article Google Scholar
Chen Y, Wu X, Jiang R. Integrating human omics data to prioritize candidate genes. BMC Med Genet. 2013;6(1):1–12.
Google Scholar
Zhang W, Sun F, Jiang R. Integrating multiple protein-protein interaction networks to prioritize disease genes: a Bayesian regression approach. BMC Bioinformatics. 2011;12(Suppl 1):S11.
Article Google Scholar
Amberger J, Bocchini C, Hamosh A. A new face and new challenges for online Mendelian inheritance in man (OMIM(R)). Hum Mutat. 2011;32(5):564–7.
Article Google Scholar
Consortium U. The universal protein resource (UniProt). Nucleic Acids Res. 2008;36(suppl 1):D190–5.
Google Scholar
Punta M, Coggill P C, Eberhardt R Y, et al. The Pfam protein families database. Nucleic acids research. 2011;40(D1):D290–D301.
Pesquita C, Faria D, Falcao AO, Lord P, Couto FM. Semantic similarity in biomedical ontologies. PLoS Comput Biol. 2009;5(7):e1000443.
Article MathSciNet Google Scholar
Wang JZ, Du Z, Payattakool R, Philip SY, Chen C-F. A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007;23(10):1274–81.
Article Google Scholar
Jaccard P: Nouvelles recherches sur la distribution florale; 1908.
Google Scholar
Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, Haibekains B, Goldenberg A. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods. 2014;11(3):333–7.
Article Google Scholar
Le OY, Min W, Zhang XF, Dai DQ, Li XL, Hong Y. A two-layer integration framework for protein complex detection. BMC Bioinformatics. 2016;2016(17):1–14.
Google Scholar
Angione C, Conway M, Lió P. Multiplex methods provide effective integration of multi-omic data in genome-scale models. BMC Bioinformatics. 2016;17(4):257–69.
Google Scholar
Xuan P, Han K, Guo Y, Li J, Li X, Zhong Y, Zhang Z, Ding J. Prediction of potential disease-associated microRNAs based on random walk. Bioinformatics. 2015;31(11):1805–15.
Article Google Scholar
Colton CA, Wilcock DM, Wink DA, Davis J, Van Nostrand WE, Vitek MP. The effects of NOS2 gene deletion on mice expressing mutated human AβPP. J Alzheimers Dis. 2008;15(4):571–87.
Article Google Scholar
Galimberti D, Scarpini E, Venturelli E, Strobel A, Herterich S, Fenoglio C, Guidi I, Scalabrini D, Cortini F, Bresolin N. Association of a NOS1 promoter repeat with Alzheimer's disease. Neurobiology Aging. 2008;29(9):1359–65.
Article Google Scholar
Cousin E, Hannequin D, Ricard S, Macé S, Génin E, Chansac C, Brice A, Dubois B, Frebourg T, Mercken L. A risk for early-onset Alzheimer's disease associated with the APBB1 gene (FE65) intron 13 polymorphism. Neurosci Lett. 2003;342(1):5–8.
Article Google Scholar
Khanin R, Wit E. How scale-free are biological networks. Journal Computational Biology. 2006;13(3):810–8.
Article MathSciNet Google Scholar
Stumpf MPH, Ingram PJ. Probability models for degree distributions of protein interaction networks. Epl. 2005;71(1):152–8.
Article MathSciNet Google Scholar

Download references

Funding

M. Guo is supported by National Natural Science Foundation of China (61,271,346, 61,571,163, and 61,532,014) and the National Key Research and Development Plan Task of China (Grant No. 2016YFC0901902). C. Wang is supported by Natural Science Foundation of China (61402132), and X. Liu is supported by Natural Science Foundation of China (91,335,112, 61,671,189). Publication costs for this article was funded by National Natural Science Foundation of China (61571163).

Availability of data and materials

The dataset(s) supporting the conclusions of this article were downloaded from the relevant public databases.

About this supplement

This article has been published as part of Journal of Biomedical Semantics Volume 8 Supplement 1, 2017: Selected articles from the Biological Ontologies and Knowledge bases workshop. The full contents of the supplement are available online at https://jbiomedsem.biomedcentral.com/articles/supplements/volume-8-supplement-1.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Harbin Institute of Technology, Harbin, 150001, People’s Republic of China
Zhen Tian, Maozu Guo, Chunyu Wang & LinLin Xing
Institute of Health Service and Medical Information Academy of Military Medical Sciences Beijing, Beijing, 100850, China
Lei Wang & Yin Zhang

Authors

Zhen Tian
View author publications
You can also search for this author in PubMed Google Scholar
Maozu Guo
View author publications
You can also search for this author in PubMed Google Scholar
Chunyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
LinLin Xing
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yin Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

ZT proposed the idea, implemented the experiments and drafted the manuscript. MG initiated the idea, conceived the whole process and finalized the paper. CW, LX, LW and YZ helped with data analysis and revised the manuscript. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Maozu Guo.

Ethics declarations

Ethics approval and consent to participate

The human GO annotations are publicly available to all researchers and are free of academic usage fees. There are no ethics issues. No human participants or individual clinical data are involved with this study.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Tian, Z., Guo, M., Wang, C. et al. Constructing an integrated gene similarity network for the identification of disease genes. J Biomed Semant 8 (Suppl 1), 32 (2017). https://doi.org/10.1186/s13326-017-0141-1

Download citation

Published: 20 September 2017
DOI: https://doi.org/10.1186/s13326-017-0141-1

Constructing an integrated gene similarity network for the identification of disease genes

Abstract

Background

Results

Conclusions

Background

Methods

Datasets

Phenotype similarity network

Phenotype-gene association network

Gene data

Construction of gene similarity networks based on genomic data of genes

Constructing gene functional similarity networks based on gene ontology

Constructing protein similarity network based on protein sequence

Constructing protein similarity network based on protein domains

Integrating gene similarity networks based on SNF method

Clustering coefficient-based threshold selection

Construction of the phenotype-gene bilayer network

Prioritizing candidate disease genes based on RWRB

Evaluation metrics of prediction performance

Results

The performance of RWRB

Effect of parameters on RWRB

Comparison with similarity-based methods

Comparison with feature-based methods

Prioritizing Alzheimer’s disease and other common disease genes by RWRB: A case study

Conclusions and discussion

The advantages of IGSN

Threshold selection for IGSN

References

Funding

Availability of data and materials

About this supplement

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Biomedical Semantics

Contact us