Evaluating the effect of annotation size on measures of semantic similarity
© The Author(s) 2017
Received: 15 October 2016
Accepted: 1 February 2017
Published: 13 February 2017
Ontologies are widely used as metadata in biological and biomedical datasets. Measures of semantic similarity utilize ontologies to determine how similar two entities annotated with classes from ontologies are, and semantic similarity is increasingly applied in applications ranging from diagnosis of disease to investigation in gene networks and functions of gene products.
Here, we analyze a large number of semantic similarity measures and the sensitivity of similarity values to the number of annotations of entities, difference in annotation size and to the depth or specificity of annotation classes. We find that most similarity measures are sensitive to the number of annotations of entities, difference in annotation size as well as to the depth of annotation classes; well-studied and richly annotated entities will usually show higher similarity than entities with only few annotations even in the absence of any biological relation.
Our findings may have significant impact on the interpretation of results that rely on measures of semantic similarity, and we demonstrate how the sensitivity to annotation size can lead to a bias when using semantic similarity to predict protein-protein interactions.
KeywordsSemantic similarity Ontology Gene ontology
Semantic similarity measures are widely used for datamining in biology and biomedicine to compare entities or groups of entities in ontologies [1, 2], and a large number of similarity measures has been developed . The similarity measures are based on information contained in ontologies combined with statistical properties of a corpus that is analyzed . There are a variety of uses for semantic similarity measures in bioinformatics, including classification of chemicals , identifying interacting proteins , finding candidate genes for a disease , or diagnosing patients .
With the increasing use of semantic similarity measures in biology, and the large number of measures that have been developed, it is important to identify a method to select an adequate similarity measure for a particular purpose. In the past, several studies have been performed that evaluate semantic similarity measures with respect to their performance on a particular task such as predicting protein-protein interactions through measures of function similarity [8–10]. While such studies can provide insights into the performance of semantic similarity measures for particular use cases, they do not serve to identify the general properties of a similarity measure, and the dataset to be analyzed, based on which the suitability of a semantic similarity measure can be determined. Specifically, when using semantic measures, it is often useful to know how the annotation size of an entity affects the resulting similarity, in particular when the corpus to which the similarity measure is applied has a high variance in the number of annotations. For example, some semantic similarity measures may always result in higher similarity values when the entities that are compared have more annotations and may therefore be more suitable to compare entities with the same number of annotations. Furthermore, the difference in annotation size can have a significant effect on the similarity measure so that comparing entities with the same number of annotations may always lead to higher (or lower) similarity values than comparing entities with a different number in annotations.
Here, we investigate features of a corpus such as the number of annotations to an entity and the variance (or difference) in annotation size on the similarity measures using a large number of similarity measures implemented in the Semantic Measures Library (SML) . We find that different semantic similarity measures respond differently to annotation size, leading to higher or lower semantic similarity values with increasing number of annotations. Furthermore, the difference in the number of annotations affects the similarity values as well. Our results have an impact on the interpretation of studies that use semantic similarity measures, and we demonstrate that some biological results may be biased due to the choice of the similarity measure. In particular, we show that the application of semantic similarity measures for predicting protein-protein interactions can result in a bias, similarly to other ‘guilt-by-association’ approaches , in which the sensitivity of the similarity measure to the annotation size confirms a bias present in protein-protein interaction networks so that well-connected and well-annotated proteins have, on average, a higher similarity by chance than proteins that are less well studied.
Generation of test data
We perform all our experiments using the Gene Ontology (GO) , downloaded on 22 December 2015 from http://geneontology.org/page/download-ontology and Human Phenotype Ontology (HPO) , download on 1 April 2016 from http://human-phenotype-ontology.github.io/downloads.html in OBO Flatfile Format. The version of GO we use consists of 44,048 classes (of which 1941 are obsolete) and HPO consists of 11,785 classes (of which 112 are obsolete). We run our experiments on several different sets of entities annotated with different number of GO or HPO classes and one set of entities annotated with GO classes from specific depth of the graph structure. The first set contains 5500 entities and we randomly annotated 100 entities each with 1,2,…,54,55 GO classes. We generate our second set of entities annotated with HPO classes in the same fashion. The third set is a set of manually curated gene annotations from the yeast genome database file (gene_associations.sgd.gz) downloaded on 26 March 2016 from http://www.yeastgenome.org/download-data/curation. The dataset consists of 6108 genes with annotations sizes varying from 1 to 55, and each group of the same size contains a different number of gene products. We ignore annotations with GO evidence code ND (No Data). The fourth set contains 1700 entities which is composed of 17 groups. Each group have 100 randomly annotated entities with GO classes from the same depth of the ontology graph structure.
Computing semantic similarity
After the random annotations were assigned to the entities, we computed the semantic similarity between each pair of entities using a large set of semantic similarity measures. We include both groupwise measures and pairwise measures with different strategies of combining them . Groupwise similarity measures determine similarity directly for two sets of classes. On the other hand, indirect similarity measures first compute the pairwise similarities for all pairs of nodes and then apply a strategy for computing the overall similarity. Strategies for the latter include computing the mean of all pairwise similarities, computing the Best Match Average, and others .
Furthermore, most semantic similarity measures rely on assigning a weight to each class in the ontology that measures the specificity of that class. We performed our experiments using an intrinsic information content measure (i.e., a measure that relies only on the structure of the ontology, not on the distribution of annotations) introduced by .
The semantic similarity measures we evaluated include the complete set of measures available in the Semantic Measures Library (SML) , and the full set of measures can be found at http://www.semantic-measures-library.org. The SML reduces an ontology to a graph structure in which nodes represent classes and edges in the graph represent axioms that hold between these classes [16, 17]. The similarity measures are then defined either between nodes of this graph or between subgraphs.
The raw data and evaluation results for all similarity measures are available as Additional file 1: Table S1. The source code for all experiments is available on GitHub at https://github.com/bio-ontology-research-group/pgsim.
In order to measure the sensitivity of the similarity measures to the number of annotations we calculated Spearman and Pearson correlation coefficients between set of annotations sizes and the set of average similarity of one size group to all the others. In other words, we first computed the average similarities for each entity in a group with fixed annotation size and computed the average similarity to all entities in our corpus. For calculating the correlation coefficients we used SciPy library .
We evaluate our results using protein-protein interaction data from BioGRID  for yeast, downloaded on 26 March 2016 from http://downloads.yeastgenome.org/curation/literature/interaction_data.tab. The file contains 340,350 interactions for 9868 unique genes. We filtered these interactions using the set of 6108 genes from the yeast genome database and our final interaction dataset includes 224,997 interactions with 5804 unique genes. Then we compute similarities between each pair of genes using simGIC measure  and Resnik’s similarity measure  combined with Average and Best Match Average (BMA) strategies and generate similarity matrices. Additionally, we create a dataset with random GO annotations for the same number of genes, and the same number of annotations for each gene. We also generate the similarity matrices for this set using the same similarity measures. To evaluate our results, we use the similarity values as a prediction score, and compute the receiver operating characteristic (ROC) curves (i.e., a plot of true positive rate as function of false positive rate)  for each similarity measure by treating pairs of genes that have a known PPI as positive and all other pairs of proteins as negatives.
In order to determine if our results are valid for protein-protein interaction data from other organisms, we perform a similar evaluation with mouse and human interactions. We downloaded manually curated gene function annotations from http://www.geneontology.org/gene-associations/ for mouse (gene_associations.mgi.gz) and human (gene_associations.goa_human.gz) on 12 November 2016. The mouse annotations contain 19,256 genes with annotations size varying from 1 to 252 and human annotations contain 19,256 genes with annotations size varying from 1 to 213. We generate random annotations with the same annotations sizes for both datasets and compute similarity values using Resnik’s similarity measure combined with BMA strategy. For predicting protein-protein interactions we use BioGRID interactions downloaded on 16 November 2016 from https://thebiogrid.org/download.php. There are 38,513 gene interactions for mouse and 329,833 interactions for human.
To evaluate our results with differnt ontologies, we aim to predict gene–disease associations using phenotypic similarity between genes and diseases. We use mouse phenotype annotations and mouse gene–disease associations downloaded from http://www.informatics.jax.org/downloads/reports/index.html(MGI_PhenoGenoMP.rpt and MGI_Geno_Disease.rpt). The dataset contains 18,378 genes annotated with Mammalian Phenotype Ontology (MPO)  classes with size varying from 1 to 1671, and 1424 of genes have 1770 associations with 1302 Mendelian diseases. We downloaded Mendelian disease phenotype annotations from http://compbio.charite.de/jenkins/job/hpo.annotations.monthly/lastStableBuild/ and generated random annotations with the same sizes for both gene and disease annotation datasets. We computed similarity of each gene to each disease by computing the Resnik’s similarity measure combined with BMA strategy between sets of MPO terms and HPO terms based on PhenomeNET Ontology . Using this similarity value as a prediction score we computed ROC curves for real and random annotations.
Results and discussion
Our aim is to test three main hypothesis. First, we evaluate whether the annotation size has an effect on similarity measures, and quantify that effect using measures of correlation and statistics. We further evaluate whether annotation size has an effect on the variance of similarity values. Second, we evaluate whether the difference in the number of annotations between the entities that are compared has an effect on the similarity measure, and quantify the effects through measures of correlation. Third, we evaluate whether the depth of the annotation classes has an effect on similarity measures. Finally, we classify semantic similarity measures in different categories based on how they behave with respect to annotation size, differences in annotation size and depth of annotation classes, using the correlation coefficients between similarity value.
Spearman and Pearson correlation coefficients between similarity value and absolute annotation size as well as between variance in similarity value and annotation size
GIC (Graph Information Content)
NTO (Normalized Term Overlap)
UI (Union Intersection)
BMA with Jiang, Conrath 1997
BMA with Lin 1998
BMA with Resnik 1995
BMA with Schlicker 2006
To determine whether the results we obtain also hold for a real biological dataset, we further evaluated the semantic similarity between yeast proteins using a set of selected semantic similarity measures. We find that the results in our test corpus are also valid for the semantic similarly of yeast proteins. Figure 1 shows the average similarity of yeast proteins as a function of the annotation size for two semantic similarity measures.
For example, the protein YGR237C has only a single annotation, and the average similarly, using the simGIC measure, is 0.035 across the set of all yeast proteins. On the other hand, protein CDC28, a more richly annotated protein with 55 annotations, has as average similarly 0.142 (more than 4-fold increase). These results suggest that some entities have, on average and while comparing similarity to exactly the same set of entities, higher similarity, proportional to the number of annotations they have.
Spearman and Pearson correlation coefficients between similarity value and difference in annotation size as well as between variance in similarity value and difference in annotation size
GIC (Graph Information Content)
NTO (Normalized Term Overlap)
UI (Union Intersection)
BMA with Jiang, Conrath 1997
BMA with Lin 1998
BMA with Resnik 1995
BMA with Schlicker 2006
In our third experiment, we evaluate whether the depth of the annotation classes has an effect on the similarity measure. We use our fourth dataset which we randomly generated based on the depth of classes in the GO. The maximum depth in GO is 17, and we generate 17 groups of random annotations. We then compute the average similarity of the synthetic entities within one group to all the other groups, and report Pearsson and Spearman correlation coefficients between annotation class depth and average similarities to determine the sensitivity of the similarity to annotation class depth. Figure 1 shows our results using synthetic data as well as functional annotations of yeast proteins for Resnik’s similarity measure (using the Best Match Average strategy) and the simGIC measure, and Table 2 summarizes the results. We find that for most measures, average similarity increases with the depth of the annotations, i.e., the more specific a class is the higher the average similarity to other classes.
A classification of similarity measures
Our finding allows us to broadly group semantic similarity measures into groups depending on their sensitivity to annotation size and difference in annotation size. We distinguish positive correlation (Pearsson correlation >0.5), no correlation (Pearsson correlation between −0.5 and 0.5), and negative correlation (Pearsson correlation <0.5), and classify the semantic similarity measures based on whether they are correlated with annotation size, difference in annotation size, and depth. Additional file 1: Table S1 provides a comprehensive summary of our results.
By far the largest group of similarity measures has a positive correlation between annotation size and similarity value, and a negative correlation between variance and annotation size. Popular similarity measures such as Resnik’s measure  with the Best Match Average combination strategy, and the simGIC similarity measure , fall in this group. A second group of similarity measures has no, or only small, correlation between annotation size and similarity values, and might therefore be better suited to compare entities with a large variance in annotation sizes. The Normalized Term Overlap (NTO) measure  falls into this group. Finally, a third group results in lower similarity values with increasing annotation size.
Impact on data analysis
In order to test our results on an established biological use case involving computation of semantic similarity, we conducted an experiment by predicting protein-protein interactions using the similarity measures. Prediction of protein-protein interactions is often used to evaluate and test semantic similarity measures [8–10], but similar methods and underlying hypotheses are also used for candidate gene prioritization  in guilt-by-association approaches .
The good performance in predicting PPIs in the absence of biological information is rather surprising. We hypothesized that well-studied proteins generally have more known functions and more known interactions, and also that genes involved in several diseases have more phenotype annotations. The Pearson correlation coefficient between the number of interactions and the number of functions in our yeast dataset is 0.34, in the human dataset 0.23, and 0.36 in the mouse PPI dataset. Similarly, in our dataset of gene–disease associations, there is a correlation between the number of phenotype annotations and the number of gene–disease associations (0.42 Pearson correlation coefficient). While the correlations are relatively small, there is nevertheless a bias that is confirmed by selecting a similarity measure that follows the same bias. We tested whether the same phenomenon occurs with another similarity measure that is not sensitive to the annotation size or difference in annotation size. Using Resnik’s measure with the Average strategy for combining the similarity values, we obtain a ROC AUC of 0.52 when predicting yeast PPIs. Although this ROC AUC is still significantly better than random (p≤10−6, one-sided Wilcoxon signed rank test), the effect is much lower compared to other measures.
In the context of gene networks, prior research has shown that the amount of functional annotation and network connectivity may result in biased results for certain types of analyses, leading the authors to conclude that the “guilt by association” principle holds only in exceptional cases . Our analysis suggests that similar biases may be introduced in applications of semantic similarity measures such that heavily annotated entities will have, on average and without the presence of any biological relation between entities, a higher similarity to other entities than entities with only few annotations. A similar but inverse effect exists for differences in annotation size. Consequently, comparing entities with many annotations (e.g., well-studied gene products or diseases) to entities with few annotations (e.g., novel or not well-studied gene products) will result, on average, in the lowest similarity values, while comparing well-studied entities to other well-studied entities (both with high annotation size and no or only small differences in annotation size) will result in higher average similarity for most similarity measures even in the absence of any biological relation.
We find that the annotation size of entities clearly plays a role when comparing entities through measures of semantic similarity, and additionally that the difference in annotation size also plays a role. This has an impact on the interpretation of semantic similarity values in several applications that use semantic similarity as a proxy for biological similarity, and the applications include prioritizing candidate genes , validating text mining results , or identifying interacting proteins . Similarly to a previous study on protein-protein interaction networks , we demonstrate that the sensitivity of similarity measures to annotation size can lead to a bias when predicting protein-protein interactions. These results should be taken into account when interpreting semantic similarity values.
In the future, methods need to be identified to correct for the effects of annotation size and difference in annotation size. Adding richer axioms to ontologies or employing similarity measures that can utilize axioms such as disjointness between classes  does not on its own suffice to remove the bias we identify, mainly because the relation between annotated entities (genes or gene products) and the classes in the ontologies does not consider disjointness axioms. It is very common for a gene product to be annotated to two disjoint GO classes, because one gene product may be involved in multiple functions (such as “vocalization behavior” and “transcription factor activity”) since gene products are not instances of GO classes but rather are related by a has function relation (or similar) to some instance of the GO class. A possible approach could be to rely on the exact distribution of similarity values for individual entities  and use a statistical tests to determine the significance of an observed similarity value. An alternative strategy could rely on expected similarity values based on the distribution of annotations in the corpus and the structure of the ontology and adjusting similarity values accordingly so that only increase over expected similarity values are taken into consideration.
Area under curve
Best match average
Human phenotype ontology
Normalized term overlap
Receiver operating characteristic
Semantic measures library
This research was supported by funding from the King Abdullah University of Science and Technology.
Availability of data and materials
All source code developed for this study is available from https://github.com/bio-ontology-research-group/pgsim.
RH conveived of the study, MK performed the experiments and evaluation, all authors interpreted the results and wrote the manuscript. Both authors have read and approved the final version of the manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Pesquita C, Faria D, Falcao AO, Lord P, Couto FM. Semantic similarity in biomedical ontologies. PLoS Comput Biol. 2009; 5(7):1000443.MathSciNetView ArticleGoogle Scholar
- Hoehndorf R, Gkoutos GV, Schofield PN. In: Carugo O, Eisenhaber F, (eds).Datamining with Ontologies. New York: Springer; 2016, pp. 385–97.Google Scholar
- Harispe S, et al. Semantic similarity from natural language and ontology analysis. Synth Lect Hum Lang Technol. 2015; 8(1):1–254.View ArticleGoogle Scholar
- Ferreira JD, Couto FM. Semantic similarity for automatic classification of chemical compounds. PLoS Comput Biol. 2010; 6(9):1000937. doi:10.1371/journal.pcbi.1000937.View ArticleGoogle Scholar
- Chiang JH, Ju JH. Discovering novel protein–protein interactions by measuring the protein semantic similarity from the biomedical literature. J Bioinform Comput Biol. 2014; 12(06):1442008.View ArticleGoogle Scholar
- Hoehndorf R, Schofield PN, Gkoutos GV. Phenomenet: a whole-phenome approach to disease gene discovery. Nucleic Acids Res. 2011; 39(18):119–9.View ArticleGoogle Scholar
- Köhler S, et al. Clinical diagnostics in human genetics with semantic similarity searches in ontologies. Am J Hum Genet. 2009; 85(4):457–64.View ArticleGoogle Scholar
- Couto FM, Silva MJ, Coutinho PM. Finding genomic ontology terms in text using evidence content. BMC Bioinforma. 2005; (6 Suppl 1):S21.View ArticleGoogle Scholar
- Wu X, et al. Prediction of yeast protein–protein interaction network: insights from the gene ontology and annotations. Nucleic Acids Res. 2006; 34(7):2137–50.View ArticleGoogle Scholar
- Xu T, Du L, Zhou Y. Evaluation of go-based functional similarity measures using s. cerevisiae protein interaction and expression profile data. BMC Bioinforma. 2008; 9(1):1–10.View ArticleGoogle Scholar
- Harispe S, et al. The semantic measures library and toolkit: fast computation of semantic similarity and relatedness using biomedical ontologies. Bioinformatics. 2014; 30(5):740–2.View ArticleGoogle Scholar
- Gillis J, Pavlidis P. “Guilt by Association” is the exception rather than the rule in gene networks. PLoS Comput Biol. 2012; 8(3):1002444.View ArticleGoogle Scholar
- Ashburner M, et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000; 25(1):25–9.View ArticleGoogle Scholar
- Köhler S, Doelken SC, Mungall CJ, Bauer S, Firth HV, Bailleul-Forestier I, Black GCM, Brown DL, Brudno M, Campbell J, FitzPatrick DR, Eppig JT, Jackson AP, Freson K, Girdea M, Helbig I, Hurst JA, Jähn J, Jackson LG, Kelly AM, Ledbetter DH, Mansour S, Martin CL, Moss C, Mumford A, Ouwehand WH, Park SM, Riggs ER, Scott RH, Sisodiya S, Vooren SV, Wapner RJ, Wilkie AOM, Wright CF, Vulto-van Silfhout AT, Leeuw Nd, de Vries BBA, Washingthon NL, Smith CL, Westerfield M, Schofield P, Ruef BJ, Gkoutos GV, Haendel M, Smedley D, Lewis SE, Robinson PN. The human phenotype ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 2014; 42(D1):966–74.View ArticleGoogle Scholar
- Batet M, Sánchez D, Valls A. An ontology-based measure to compute semantic similarity in biomedicine. J Biomed Inform. 2011; 44(1):118–25. Ontologies for Clinical and Translational Research.View ArticleGoogle Scholar
- Smith B, Ceusters W, Klagges B, Köhler J, Kumar A, Lomax J, Mungall C, Neuhaus F, Rector AL, Rosse C. Relations in biomedical ontologies. Genome Biol. 2005; 6(5):46.View ArticleGoogle Scholar
- Hoehndorf R, Oellrich A, Dumontier M, Kelso J, Rebholz-Schuhmann D, Herre H. Relations as patterns: Bridging the gap between OBO and OWL. BMC Bioinforma. 2010; 11(1):441.View ArticleGoogle Scholar
- Jones E, Oliphant T, Peterson P, et al. SciPy: Open source scientific tools for Python. 2001. http://www.scipy.org/. Accessed 2016-04-06.
- Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006; 34(suppl 1):535–9. doi:10.1093/nar/gkj109. http://nar.oxfordjournals.org/content/34/suppl_1/D535.full.pdf+html.View ArticleGoogle Scholar
- Resnik P. Semantic similarity in a taxonomy: An Information-Based measure and its application to problems of ambiguity in natural language. J Artif Intell Res. 1999; 11:95–130.MATHGoogle Scholar
- Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett. 2006; 27(8):861–74. doi:10.1016/j.patrec.2005.10.010.MathSciNetView ArticleGoogle Scholar
- Bello SM, Richardson JE, Davis AP, Wiegers TC, Mattingly CJ, Dolan ME, Smith CL, Blake JA, Eppig JT. Disease model curation improvements at mouse genome informatics. Database. 2012; 2012:63.View ArticleGoogle Scholar
- Pesquita C, Faria D, Bastos H, Ferreira AEN, Falcão AO, Couto FM. Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinforma. 2008; (9 Suppl 5):S4. doi:10.1186/1471-2105-9-S5-S4.View ArticleGoogle Scholar
- Mistry M, Pavlidis P. Gene ontology term overlap as a measure of gene functional similarity. BMC Bioinforma. 2008; 9(1):1–11.View ArticleGoogle Scholar
- Schlicker A, Lengauer T, Albrecht M. Improving disease gene prioritization using the semantic similarity of gene ontology terms. Bioinformatics. 2010; 26(18):561–7.View ArticleGoogle Scholar
- Hoehndorf R, Schofield PN, Gkoutos GV. An integrative, translational approach to understanding rare and orphan genetically based diseases. Interface Focus. 2013; 3(2):20120055. doi:10.1098/rsfs.2012.0055.View ArticleGoogle Scholar
- Lamurias A, Ferreira JD, Couto FM. Improving chemical entity recognition through h-index based semantic similarity. J Cheminformatics. 2015; 7(1):13. doi:10.1186/1758-2946-7-S1-S13.View ArticleGoogle Scholar
- Ferreira JD, Hastings J, Couto FM. Exploiting disjointness axioms to improve semantic similarity measures. Bioinformatics. 2013; 29(21):2781–787. doi:10.1093/bioinformatics/btt491. http://bioinformatics.oxfordjournals.org/content/29/21/2781.full.pdf+html.View ArticleGoogle Scholar
- Schulz MH, Köhler S, Bauer S, Robinson PN. Exact score distribution computation for ontological similarity searches. BMC Bioinforma. 2011; 12(1):441. doi:10.1186/1471-2105-12-441.View ArticleGoogle Scholar