In this section, we present a novel method for calculating the semantic similarity between proteins. First, we introduce basic background of the Gene Ontology. Then we describe the details of the graph kernel method.
Gene ontology and gene ontology annotations
The GO project [1] maintains a dynamic, structured, precisely defined, and controlled vocabulary of terms for describing the properties of gene products across species. The GO consists of three different ontologies describing: 1) biological processes (BP), where a process often involves a chemical or physical transformation (e.g. cell growth); 2) molecular functions (MF), where functions are defined as the biochemical activity of gene products (e.g. enzymes); and 3) cellular components (CC), which refers to places in the cell where gene products are active (e.g. nuclear membrane). Each ontology is structured as a directed acyclic graph, where nodes (GO terms) are linked to each other through "is-a", "part-of" or "regulates" relationships. On the other hand, the annotation of gene products is the process of assigning ontology terms to gene products in order to describe their activities and localization. For example, the GOA project [2], at the European Bioinformatics Institute (EBI), aims to provide high-quality electronic and manual annotations to UniProt KnowledgeBase (UniProtKB) entries [19]. GOA annotations are obtained from strictly controlled methods, where every association is supported by a distinct evidence source. A protein can be annotated with multiple GO terms from any of the three ontologies in the GO. Functional annotations of UniProtKB proteins currently consist of over 32 million annotations, which cover more than 4 million proteins [2].
Graph representation of proteins
We represented a protein using a subgraph of the ontology that consisted of all the GO terms annotating the protein and their ancestors in the ontology. Each edge of the graph corresponds to a relationship between two terms in the ontology. There are three types of relations in the GO: is-a, part-of, and regulates. Since the GO includes three different ontologies, the resulting graph will be different when a different ontology is used. For example, Figure 1 shows the graph generated for UniprotKB protein P17252, using the Cellular Component (CC) ontology.
A shortest-path graph kernel for proteins
We used a shortest-path graph kernel to compare two graphs as proposed in [20]. First, let's define the shortest-path graph. Given a graph G = (V, E), its shortest-path graph is G
sp
= (V, E'), where E' = {e'
1
,...,e'
l
} such that e'
i
= (u, v), where u ∈ V, v ∈ V, and path(u, v)≠0. That is, G
sp
has the same vertices as G and the edge (u, v) in G
sp
has the same length as the shortest distance between u and v in G. This transformation can be performed using any all-pairs shortest path algorithm. In particular, the Floyd-Warshall algorithm is used in spgk because it is straightforward and has time complexity of O(n3). Then, for a pair of graphs, the shortest-path kernel calculates their similarity by comparing every pair of edges in their shortest-path graphs. For example, Let G
1
= (V
1
, E
1
) and G
2
= (V
2
, E
2
) be two graphs and G
1sp
= (V
1
, E'
1
) and G
2sp
= (V
2
, E'
2
) be their shortest-path graphs respectively. The similarity between G
1
and G
2
can be calculated using Eq. 1.
where k
walk
is a positive definite kernel for comparing two walks. In this case, a walk includes an edge and its two end nodes. Let e
1
be the edge connecting nodes v
1
and w
1
, and e
2
be the edge connecting nodes v
2
and w
2
, then k
walk
(e
1
, e
2
) is defined by Eq. 2.
where k
node
is a kernel function for comparing two nodes, which returns 1 when the two nodes are identical and 0 otherwise, and k
edge
is a kernel function for comparing two edges. k
edge
is a Brownian bridge kernel that returns the largest value when two edges have identical length, and 0 when the edges differ in length more than a constant c as shown in Eq. 3. In this study, we use c = 2 as suggested by [20].
Evaluation approach
We evaluated the performance of spgk by comparing the resulting semantic similarities with protein functional similarities derived from expert annotations. Functional similarities between proteins were derived from the Pfam database [21] as described by Couto et al. [13]. Let P denote a protein and F(P) = {f
1
, f
2
,..., f
n
} be the set of Pfam families that P is associated with. Then the functional similarity between two proteins P
i
and P
j
is given by Eq. 4
Previous study by Xu et al. [7] shows that having more annotations per protein in the dataset leads to more reliable functional similarity estimation from the GO. Thus, for the purpose of evaluation, we carefully selected a set of 100 proteins from GOA, such that they were the top 100 proteins with the highest numbers of annotations. We also ensured that for any selected protein: 1) it existed in the UniProtKB/Swiss-Prot database, 2) it had at least one annotation from each of the three ontologies in GOA-Uniprot, and 3) it had at least one Pfam-A annotation. The evaluation proceeded as follows: First, the graph kernel was used to calculate pairwise semantic similarities for a set of proteins. Second, pairwise functional similarities between the proteins were calculated based on the Pfam database annotations. Last, the Pearson's Correlation Coefficient between the semantic and functional similarities was calculated. If two proteins have similar function, then a good semantic similarity method should detect high semantic similarity between them. Thus, higher values of Pearson's Correlation Coefficient indicate better performance in the calculation of the semantic similarity. This procedure was repeated for each of the three ontologies in the GO, namely, BP, MF, and CC.