Tackling the challenges of matching biomedical ontologies

Faria, Daniel; Pesquita, Catia; Mott, Isabela; Martins, Catarina; Couto, Francisco M.; Cruz, Isabel F.

doi:10.1186/s13326-017-0170-9

Research
Open access
Published: 15 January 2018

Tackling the challenges of matching biomedical ontologies

Daniel Faria ORCID: orcid.org/0000-0003-1511-277X¹,
Catia Pesquita²,
Isabela Mott²,
Catarina Martins³,
Francisco M. Couto² &
…
Isabel F. Cruz⁴

Journal of Biomedical Semantics volume 9, Article number: 4 (2018) Cite this article

5728 Accesses
32 Citations
7 Altmetric
Metrics details

Abstract

Background

Biomedical ontologies pose several challenges to ontology matching due both to the complexity of the biomedical domain and to the characteristics of the ontologies themselves. The biomedical tracks in the Ontology Matching Evaluation Initiative (OAEI) have spurred the development of matching systems able to tackle these challenges, and benchmarked their general performance. In this study, we dissect the strategies employed by matching systems to tackle the challenges of matching biomedical ontologies and gauge the impact of the challenges themselves on matching performance, using the AgreementMakerLight (AML) system as the platform for this study.

Results

We demonstrate that the linear complexity of the hash-based searching strategy implemented by most state-of-the-art ontology matching systems is essential for matching large biomedical ontologies efficiently. We show that accounting for all lexical annotations (e.g., labels and synonyms) in biomedical ontologies leads to a substantial improvement in F-measure over using only the primary name, and that accounting for the reliability of different types of annotations generally also leads to a marked improvement. Finally, we show that cross-references are a reliable source of information and that, when using biomedical ontologies as background knowledge, it is generally more reliable to use them as mediators than to perform lexical expansion.

Conclusions

We anticipate that translating traditional matching algorithms to the hash-based searching paradigm will be a critical direction for the future development of the field. Improving the evaluation carried out in the biomedical tracks of the OAEI will also be important, as without proper reference alignments there is only so much that can be ascertained about matching systems or strategies. Nevertheless, it is clear that, to tackle the various challenges posed by biomedical ontologies, ontology matching systems must be able to efficiently combine multiple strategies into a mature matching approach.

Background

The biomedical domain presents a strong case for the application of ontology matching, as there are hundreds of biomedical ontologies which were mostly developed independently, and many of them cover overlapping domains [1]. Establishing meaningful links between such ontologies is critical to ensure interoperability and has the potential to unlock biomedical knowledge by bridging siloed data. However, biomedical ontologies present some of the most significant challenges to the field of ontology matching, given their characteristics and the complexity of the domain they cover.

The first hurdle that ontology matching systems must overcome to match biomedical ontologies is their large size. Many of the most widely used biomedical ontologies have tens of thousands of classes (e.g., the Gene Ontology, the Uber Anatomy Ontology) or even hundreds of thousands (e.g., the SNOMED Clinical Terms, the Chemical Entities of Biological Interest Ontology). Handling such large ontologies presents computational challenges throughout the ontology matching pipeline. Matching systems must first be able to load the ontologies in a memory efficient manner, then circumvent the quadratic complexity of the matching problem, and finally be able to effectively select the final set of mappings from a potentially large universe of plausible mapping candidates. Without tackling these challenges, ontology matching systems cannot match large biomedical ontologies in practice.

Another challenge in matching biomedical ontologies is the rich and complex vocabulary of the biomedical domain. Biomedical ontologies possess a rich lexical component, with each class being typically described by several annotations such as labels and different types of synonyms (e.g., exact, broad, narrow, related). For example, the Uber Anatomy Ontology (UBERON) class UBERON_0000948 has amongst its annotations: label “heart”, exact synonyms “vertebrate heart” and “chambered heart”, narrow synonym “branchial heart”, related synonym “cardium”, and even relational adjective “cardiac” [2]. Taking into account this lexical complexity is critical for matching biomedical ontologies effectively. On the one hand, matching systems must make use of all these annotations to obtain a reasonable recall, as what is a label in one ontology may be a synonym in another. On the other hand, systems must be able to account for the different specificity of the various types of annotations to successfully navigate through the many cases of homonymy, paronymy, and overlapping words, and thus attain a high precision. For example, the Foundational Model of Anatomy (FMA) class 59762 has label “gingiva” and exact synonym “gum” [3], but the word gum also has different meanings in the biomedical context. It can refer to a dietary gum, as is the case of the National Cancer Institute Thesaurus (NCI) class C68500 which has “gum” as exact synonym [4], and also to a type of drug preparation, as is the case of the SNOMED Clinical Terms (SNOMED) class 426210003, which has “gum” as a label [5]. However, the last two ontologies also have a class with label “gingiva” (and synonym “gum”), so this case illustrates how valuing label-to-label mappings over mappings involving synonyms would enable matching systems to find the correct mappings and avoid the incorrect ones when matching either of these two ontologies with FMA.

It is common across all domains that different ontologies have different modeling views of a given domain. However, the complexity of the biomedical domain makes this particularly challenging. Biomedical ontologies on the same domain can have profound differences in organization to the point that they are logically irreconcilable due to conflicting restrictions. For instance, in NCI, anatomical structures and proteins are modeled as disjoint, and consequently the fibrillar or filamentous form of actin (class C32581) which is an anatomical structure is disjoint with the actin protein (class C16258). By contrast, in FMA, proteins are modeled as anatomical structures, and fibrillar actin (class 67844) is actually a subclass of the actin protein (class 67843). Thus, while it would be biologically correct to map the classes describing the fibrillar form of actin and the classes describing the actin protein of each ontology, doing so would cause logical conflicts if the two ontologies were integrated in this way. This example illustrates the trade-off between completeness and logical soundness that often must be considered when matching biomedical ontologies [6].

The simple but particular semantics of biomedical ontologies is another aspect that differentiates them from the ontologies of other domains. Most biomedical ontologies have few properties and relatively simple semantics – for instance, half of the ontologies in BioPortal fit into the tractable OWL2EL profile [7]. This fact, together with the lexical richness and frequent modeling differences of biomedical ontologies, means that strategies for matching these ontologies tend to rely primarily on lexical matching algorithms, with structural matching algorithms taking a secondary role, if employed at all. However, when they are employed, structural matching algorithms should take into consideration an object property of particular importance in biomedical ontologies—“part of”— which often accounts for a second hierarchical backbone (a partonomy) in complement of the taxonomic backbone defined by subclass relations.

While the specialized biomedical vocabulary may render general purpose lexical tools such as WordNet ineffective, the increasing profusion of biomedical ontologies means that there are usually abundant sources of background knowledge available to ontology matching systems in the form of external related ontologies. The challenge lies in identifying the most suitable and useful sources of background knowledge among potentially hundreds of candidate ontologies. Addressing this challenge has been the topic of several studies which proposed metrics for estimating the usefulness of background ontologies [8, 9]. Of particular relevance as sources of background knowledge are the efforts of the OBO Foundry to include external references in their ontologies [10]. There are two forms of these: direct cross-references to other ontologies, and logical definitions that correspond to composite references to two or more other ontologies. Both are manually curated, high-quality knowledge sources that can be reused as background knowledge by ontology matching systems.

The relevance of the biomedical domain for ontology matching and the interesting challenges it raises have motivated the inclusion of a growing number of biomedical ontology alignment tasks in the Ontology Matching Evaluation Initiative (OAEI). These tracks have played a key role in driving forward the development of systems and strategies to tackle many of the challenges of matching biomedical ontologies. While the OAEI has done an excellent job at evaluating ontology matching at the system level, assessing the contributions of the various strategies implemented by each system is beyond its scope, as each system is different and the level to which systems can be broken down into individual strategies varies.

In the interest of providing a more in-depth evaluation, in this study we dissect the strategies employed by matching systems to tackle the aforementioned challenges of matching biomedical ontologies and gauge the impact of the challenges themselves on matching performance. We use AgreementMakerLight (AML) [11] as a platform for the study, as it meets three critical criteria: it is one of the top performing systems in the biomedical tracks of the OAEI [12] and thus represents the state of the art; it was designed specifically for matching biomedical ontologies and thus to tackle most of the challenges involved therein; and it has a modular architecture, which is essential to enable the type of analysis we aim to conduct in this study. It is also the matching system with which we are most familiar, thus facilitating our work.

The rest of the manuscript is organized as follows: in the “Related work” section we review how matching systems participating in the OAEI have tackled the challenges of matching biomedical ontologies; in the Methods, we provide a brief overview of AML, make an in-depth analysis of the strategies AML and other top-performing matching systems employ to tackle biomedical ontologies, and describe the datasets and experimental setting; in the Results and Discussion we dissect the impact of several of the strategies implemented by AML on its effectiveness and efficiency; and finally, in the Conclusions, we provide an overarching view of the study and ponder on the aspects where the state of the art in matching biomedical ontologies can be improved.

Related work

Throughout the history of the OAEI, a number of biomedical ontology alignment tasks have been introduced and multiple matching systems have participated in them. The Anatomy track was introduced in the first OAEI proper, in 2005. The Large Biomedical Ontologies track was introduced in OAEI 2011.5 with two tasks, and expanded to six tasks in OAEI 2012. More recently, the Disease and Phenotype track was introduced in 2016 with two tasks, and expanded to four tasks in 2017. Of the many systems that have participated in one or more editions of one of these tracks, 27 have a peer-reviewed publication and thus can be reviewed with respect to how they address the challenges outlined in the previous section. Table 1 summarizes this information.

Table 1 Overview of the ontology matching systems that participated in OAEI biomedical tracks

Full size table

Handling the largest biomedical ontologies continues to prove a tough challenge for ontology matching systems. Of the surveyed systems, only 5 have been able to complete the largest tasks of the Large Biomedical Ontologies track at some point in their history. In 2016, only AML and LogMap [13] did so out of 8 independent participants [14]. Although this isn’t often detailed in publication, most systems that are able to tackle large ontologies make use of data structures with inverted indices that enable hash-based searching rather than pairwise matching, and thus circumvent the quadratic nature of ontology matching.

The lexical complexity of biomedical ontologies is also an aspect which few ontology matching systems are prepared to tackle. Most systems do make use of the WordNet [15], but this is a general purpose English lexical tool, so while it may enable systems to find some lexical variants, its coverage of the specialized biomedical vocabulary is far from comprehensive. Thus, a few matching systems such as FCA-Map [16] and LogMap [13] opt for the domain specific UMLS SPECIALIST Lexicon [17]. The diversity of ontology annotations contemplated by each matching system is unclear, but the use of weights for synonyms is an uncommon feature, and a differentiating weighting scheme to reflect the precision of different types of lexical annotations has only been reported in AML [18].

Only a few systems consider “part of” relations when employing structural matching algorithms: AgreementMaker [19], BLOOMS [20], PhenomeNET [21] and SAMBO [22] consider it explicitly, whereas AML considers all relations.

The relevance of alignment coherence, which is to say, an alignment that when used to integrate the input ontologies does not lead to logical unsatisfiabilities, has been gaining traction within the ontology matching community. The number of matching systems that either implement or reuse an alignment repair algorithm, while still relatively low, has increased in recent years. Unfortunately, the OAEI has not been able to provide a testing ground for alignment repair that highlights the conflict between completeness and logical coherence [23], as manually curated reference alignments are not available for the tasks in which alignment repair is a meaningful problem. Until such a test is created, the evaluation of alignment repair algorithms will remain superficial, and their use will remain almost exclusively automated.

While the use of background knowledge is very common among ontology matching systems, we include as background knowledge the usage of WordNet or the SPECIALIST lexicon for lexical expansion (i.e., to enrich the input ontologies with synonyms). The use of biomedical ontologies (counting the UMLS Metathesaurus [17]) as background knowledge sources is less common, occurring in only 8 of the surveyed systems. Most systems that use background knowledge employ fixed manually selected sources, with only AML, GOMMA [9] and the LogMapBio variant [24] implementing an automatic selection algorithm. The latter deserves particular mention in that it makes use of BioPortal’s [1] search engine and thus has access to virtually any biomedical ontology as background knowledge. The majority of the systems that use background knowledge make use of it for lexical expansion, as that is the main usage of the WordNet. Of the systems that employ biomedical ontologies as background knowledge, most use these ontologies as mediators, by mapping the background ontology to the background knowledge ontology, and then intersecting the two background alignments to generate an alignment between the input ontologies.

Methods

AML overview

AML is an ontology matching system originally developed to tackle the challenges of matching large biomedical ontologies [11], as its namesake and predecessor AgreementMaker [19] was not designed to handle ontologies of this size. While AML’s scope has since expanded, biomedical ontologies have remained one of the main drives behind its continued development.

AML’s ontology matching pipeline is divided into three phases: ontology loading, matching, and filtering. The pipeline is illustrated in Fig. 1.

In the ontology loading phase, the input ontologies are loaded using the OWL API [25], then parsed into AML’s data structures [11]. The most important of these are the Lexicon, which stores all the lexical information of an ontology in normalized form, and the RelationshipMap, which stores the structural information.

In the matching phase, AML’s various matching algorithms (or matchers) are executed and combined. These include [11, 12, 26]:

The LexicalMatcher, which finds literal full-name matches between the Lexicon entries of two ontologies.
The WordMatcher, which finds matches between entities by computing the word overlap between their Lexicon entries.
The StringMatcher, which finds matches between entities by computing the string similarity between their Lexicon entries using the ISub metric [27].
The ThesaurusMatcher, which find literal full-name matches involving synonyms inferred from an automatically generated thesaurus, as we will detail in the next subsection.
The MediatingMatcher, which employs the LexicalMatcher to align each of the input ontologies to a third background ontology, and then intersects those alignments to derive an alignment between the input ontologies.
The XRefMatcher, which is analogous to the MediatingMatcher, but relies primarily on OBO [10] cross-references between the background ontology and the input ontologies.
The LogicalDefMatcher, which matches classes that have equal or corresponding OBO [10] logical definitions, as we will detail in a subsequent subsection.

In the filtering phase, AML applies algorithms that remove problem-causing mapping candidates from the preliminary alignment to generate the final alignment. The problems that are addressed include cardinality conflicts (i.e., cases where a class of one ontology is mapped to more than one class of the other ontology) and logical conflicts (i.e., cases where two or more mappings cause the input ontologies to become unsatisfiable when merged via those mappings).

Cardinality conflicts are resolved using the heuristic Selector algorithm, which selects mappings in descending order of similarity score in one of its three modes: ‘strict’, in which all cardinality conflicts are resolved; ‘permissive’, which accepts cardinality conflicts in the case of similarity score ties; and ‘hybrid’, which accepts conflicting pairs of mappings with high similarity score (above 0.75) and otherwise behaves as the ‘permissive’ mode [11]. Logical conflicts are resolved by the Repairer algorithm [23].

Handling large ontologies

There are three key strategies implemented by AML and other efficient ontology matching systems to match large ontologies: hash-based searching, parallelization, and search space reduction. Additionally, large ontologies also pose problems with respect to the memory requirements of the similarity matrix.

Hash-based searching

The hash-based searching strategy is the most critical strategy for scalability, as it effectively reduces the time complexity of the matching problem from quadratic to linear. This strategy relies on using data structures based on HashMaps, with inverted indices, to store the lexical information of the ontologies. By inverted indices, we mean that rather than having the class ids as keys, their lexical attributes (e.g., the various labels and synonyms, or the words these contain) are used as keys and the values are the sets of ids of the classes that have each attribute. This enables matching systems to simply check whether each lexical attribute of one ontology occurs in the other, rather than making pairwise comparisons of the classes of the two ontologies. Since the attributes are HashMap keys, and HashMap access normally has O(1) time complexity, the hash-based searching strategy has O(n) complexity overall, where n is the number of lexical attributes in the ontology with the least attributes. By contrast, the traditional pairwise matching strategy has O(mn) complexity where m and n are the number of lexical attributes in the two ontologies to match.

The one limitation of hash-based searching is that it is usually restricted to finding equal attributes—at least when default Java String hash keys are used, as is the case in AML. Thus, it can be employed for literal full-name matches (LexicalMatcher), for matches based on overlapping words (WordMatcher), or even overlapping n-grams (not implemented in AML), but not for traditional string similarity comparisons (StringMatcher). Moreover, the effectiveness of the hash-based searching strategy hinges heavily on normalizing the lexical attributes a priori, in order to maximize the number of equal entries found.

In the case of AML, lexical attributes are normalized upon entry in the Lexicon, during the ontology loading stage. This normalization consists in removing all non-word non-digit characters (except parentheses and dash), inserting white spaces where capitalization is found within words (e.g., “hasPart” becomes “has Part”), and finally converting all characters to lower case. However, because biomedical ontologies may include special formulas (chemical or otherwise), AML uses patterns to detect whether a lexical attribute is a normal word-based name or a formula. In the latter case, the only normalization done is the replacement of underscores with white spaces.

Parallelization

Parallelization is a common strategy for improving computational efficiency that exploits the multi-core architecture of modern CPUs. In the context of ontology matching, it typically consists on distributing the computational load by the available cores by either running different (matching) algorithms in parallel or dividing an algorithm into a set of tasks and running those in parallel. While parallelization does not affect the computational complexity of the underlying algorithms, it can reduce their execution time by a factor of up to N, where N is the number of available CPU cores.

AML’s StringMatcher and Repairer algorithms are both implemented for parallelization via subdivision into parallel tasks, given that they are the two main bottlenecks in AML’s matching pipeline. AML’s remaining matching and filtering algorithms are not implemented for parallelization because they have linear complexity and run in at most a few seconds for even the largest ontologies, so the gain in parallelizing them would be negligible to AML’s total run time.

Search space reduction

Under search space reduction, we include the two families of strategies that aim to reduce the search space of the ontology matching problem—partitioning and pruning—as well as the strategy that aims to reduce the scale of the alignment repair problem—modularization.

Partitioning or blocking consists in dividing the ontologies into (usually vertical) partitions or blocks in order to transform a single large matching problem into several smaller ones [28]. Its simplest application is to reduce the memory requirements of the matching task, as is the case in AML’s WordMatcher algorithm. However, it can also be used to reduce the search space of the matching problem by determining which blocks have a significant overlap (typically using a hash-based searching strategy) and attempting to match only those [29]. In this application, it can improve not only the efficiency but also the effectiveness of the matching process, by excluding false positives.

Pruning encompasses any strategy that dynamically avoids comparing parts of the ontologies without partitioning them beforehand [28]. The most common of these strategies is precisely hash-based searching, as it effectively only makes comparisons between entities that have equal HashMap indices (be they names, words, or n-grams). In addition to this form of pruning, AML employs another form called local matching when applying traditional pairwise matching algorithms (such as the StringMatcher) to large ontologies. This strategy consists of matching entities only in the neighborhood of mapped entities found using more efficient (and reliable) hash-based search algorithms. Like blocking, it not only improves computational efficiency but can also help filter false positives.

Modularization consists of identifying the classes that are semantically relevant for determining whether an alignment is coherent in order to reduce the search space of the repair problem. It is akin to partitioning, but is carried out after the matching stage, and contemplates both the input ontologies and the alignment between them. To enable modularization and reduce the complexity of the repair problem, repair algorithms tend to consider simplifications of the Description Logic of OWL—for instance, the repair algorithms of both AML and LogMap are based on propositional logic [13, 23]. AML’s modularization reduces the search space of the repair problem both with regard to the classes that must be tested for satisfiability (since most tests are logically redundant) and with regard to the classes that must be searched (only those with multiple parents, or involved in mappings or logical restrictions) [23].

Similarity matrices

Another consideration that is critical for matching large ontologies is that the memory requirement of a similarity matrix between two ontologies scales quadratically with their size. For example, for the FMA-SNOMED whole task of the OAEI large biomedical ontologies track, the similarity matrix would require an unwieldy 72 GB RAM if similarity scores were stored with 8 Byte precision. The strategy that AML and other efficient matching systems employ to circumvent this problem is to store a sparse matrix with only the meaningful similarity scores (i.e., those above a certain threshold, such as 0.5). In the case of AML, this matrix is stored in the form of both a list of mapping candidates, to enable sorting and selection, and a HashMap-based table, to enable efficient searching. Each of AML’s matchers produces one such sparse matrix, or preliminary alignment, which can be combined with others either by simple union (keeping the highest score for the same mapping) or hierarchically (by adding only mappings from a less precise matcher that don’t conflict with those of more precise matchers).

Handling the rich vocabulary of biomedical ontologies

Processing lexical annotations

AML, like most ontology matching systems that perform well in the biomedical domain, takes into account a wide range of lexical annotations from biomedical ontologies. Namely, AML stores in the Lexicon the local names (when not alphanumeric codes), labels, and all annotations with properties corresponding to labels or synonyms (e.g., “prefLabel”, “hasExactSynonym”, “FULLSYN”). The various annotations are condensed into four lexical categories: ‘localName’, ‘label’, ‘exactSynonym’, and ‘otherSynonym’. While this mapping is automatic, it covers the large majority of the annotation properties presently in use in biomedical ontologies and thesauri.

One strategy that, to the best of our knowledge, solely AML employs is that it assigns different numeric weights to each of its lexical categories, and uses these weights to score each mapping of lexical origin. The weighting scheme employed by AML is fixed, meaning that each lexical category is given a predetermined weight that reflect its expected reliability. This approach helps improve the effectiveness of AML’s Selector as it leads to less similarity ties and to mappings based on more reliable annotations being scored higher than those based on less reliable ones.

Inferring new synonyms

AML employs several strategies for automatically generating new synonyms, with the goal of improving the coverage and effectiveness of its hash-based searching algorithms. Having more synonyms increases the likelihood that corresponding concepts are described using equal lexical entries, and thus will tend to increase recall, but may also decrease precision.

One strategy AML employs is to automatically generate synonyms for classes by removing stop words from their names, using a predefined stop word list, as well as by removing name portions within parentheses. For example, for the SNOMED lexical entry “structure of nervous system”, AML generates the synonym “nervous system” by removing the leading stop words “structure” and “of”, and adds this synonym to Lexicon assigned to all classes for which the original entry was assigned. Analogously, for the NCI lexical entry “mixed mesodermal (mullerian) tumor”, AML generates the synonym “mixed mesodermal tumor” by removing the section within parentheses.

Another strategy AML employs for synonym generation consists in generating a thesaurus by comparing the various annotations of each class, and then using this thesaurus to generate new synonyms [12, 18]. For example, given a lexical analysis of the annotations ’stomach serosa’ and ’gastric serosa’ for Mouse Gross Anatomy Ontology (MA) class MA_0001626, AML would add to its thesaurus that ’stomach’ and ’gastric’ are synonymous words. It would then use this information to generate new synonyms for lexical entries containing either of the words by replacing it with the other. In order to contain the loss in precision that this strategy tends to generate, AML employs it in a dedicated matching algorithm, the ThesaurusMatcher, which finds only exact matches involving synonyms generated by the thesaurus.

Finally, AML can also use background knowledge sources to generate synonyms, but this strategy is detailed in the next subsection.

Exploiting background knowledge

Background knowledge selection

The problem of automatically identifying relevant sources of background knowledge has been the subject of several studies [8, 9]. Most rely on analyzing the background knowledge sources to determine their overlap with the input ontologies, yet overlap does not imply usefulness. A background knowledge source is only useful if it contains (lexical or structural) knowledge not contained in the input ontologies and which is relevant to match them, or in other words, if we can find new mappings by using it (assuming it is reliable, and thus the mappings will mostly be correct). Given that, when employing a hash-based search algorithm, the difference in cost between computing a background knowledge alignment and computing an overlap is negligible, we might as well do the former and obtain a more direct measure of usefulness.

These are the foundations of AML’s algorithm for automatic selection of background knowledge sources [8]. This algorithm employs the concept of mapping gain, defined as the relative number of new mappings that an alignment would add to another alignment, as measure of usefulness. In a first stage, it uses the mapping gain over the baseline LexicalMatcher alignment to measure the individual usefulness of each candidate background knowledge source, and preselect them. In a second stage, it iterates through the preselected sources in descending order of individual mapping gain, recomputes the mapping gain over the current baseline alignment, and if significant, adds that background knowledge alignment to the baseline. Thus, it can not only identify the most promising individual background knowledge source, but also select a near-optimal combination of multiple background knowledge sources.

Information sources

Like most matching systems, AML relies primarily on the lexical information of background knowledge ontologies (MediatingMatcher). However, when OBO cross-references are available, it can use them instead of or in addition to the lexical information via its XRefMatcher [26]. Cross-references are essentially manually-curated mappings between an OBO ontology and others, listed in the ontology itself. For example, the UBERON class UBERON_0001275 (“pubis”) includes cross-references (via annotation property “hasDbXRef”) to FMA class 16595 (“pubis”) and NCI class C33423 (“pubic bone”). AML’s XRefMatcher employs these cross-references instead of performing lexical matches between the input ontologies and the background knowledge ontology, then like the MediatingMatcher, intersects the background knowledge alignments to derive an alignment between the two input ontologies. In the example above, if we were matching FMA to NCI using UBERON as a background knowledge source, it would map the FMA class to the NCI class because they are referenced by the same UBERON class.

Cross-references do not necessarily correspond to equivalence relations; all that is implied is a close semantic overlap. However, the same could also be said of ontology mappings: even if formally equivalence is always implied, the strictness with which it is meant varies from mapping to mapping. Thus, we found cross-references to be more reliable than literal lexical matches for inferring background knowledge mappings. For this reason, AML’s XRefMatcher supersedes its MediatingMatcher, as it uses cross-references when these are available, but complements them with lexical matches when the latter provide at least twice the coverage of the input ontology. Thus, it contemplates cases such as cross-references only being available for one of the input ontologies, as well as being available for both but only covering part of them.

Background knowledge usage

In addition to the traditional use of background knowledge ontologies as mediators, AML can also use them for lexical expansion, i.e., to generate new synonyms in the input ontologies. This strategy consists in adding, for each class of each of the ontologies to match that has a correspondence to a class of the background knowledge ontology, all the lexical entries of the latter as new synonyms. These correspondences must first be established by mapping the input ontologies to the background knowledge ontology, via either the MediatingMatcher or the XRefMatcher.

Given that the problem of handling large ontologies is compounded when using background knowledge ontologies, as not one but three matching tasks are required, the lexical expansion strategy enables AML to harness the knowledge contained in background knowledge ontologies more efficiently. It makes no difference from the use of background knowledge ontologies as mediators with regard to finding full-name matches, but it allows for partial matches to be indirectly derived from the background knowledge ontology with a single use (rather than three) of either the WordMatcher or the StringMatcher. However, deriving indirect partial matches can lead to a significant decrease in precision, meaning that this strategy can be less reliable than the mediating strategy.

Using logical definitions

AML has recently begun exploring the use of the logical definitions encoded in OBO Foundry ontologies [10] for ontology matching [12]. Logical definitions (or cross-products) correspond to composite mappings, where a class of one ontology is declared as equivalent to the intersection of two or more other classes of different ontologies. For example, the Human Phenotype Ontology (HP) [30] class HP_0000892 (“bifid ribs”) corresponds to Phenotypic Quality Ontology [31] class PATO_0000403 (“cleft”) inhering in the UBERON class UBERON_0002228 (“rib”) with modifier PATO_0000460 (“abnormal”), as depicted in Fig. 2. They are not strictly background knowledge in the sense that they are included in the ontologies themselves, but they do correspond to mappings to external ontologies. AML’s LogicalDefMatcher maps classes that have identical logical definitions. Continuing from the previous example, it would detect that Mammalian Phenotype Ontology (MP) [32] class MP_0000153 (“rib bifurcation”) has the exact same logical definition as HP_0000892 and thus map the two classes, as shown in Fig. 2. This is an example of a mapping that could not be found through lexical or structural matching approaches, but which logical definitions enable us to find.

Evaluation

Datasets

The datasets used in this study were the OAEI 2016 datasets from the Anatomy, Large Biomedical Ontologies, and Disease and Phenotype tracks [14]:

The Anatomy track consists of matching the Mouse Gross Anatomy Ontology [33] with the portion of the NCI Thesaurus [4] describing the human anatomy. It is evaluated using a manually curated reference alignment.
The Large Biomedical Ontologies track features six matching tasks that consist in the pairwise matching of FMA [3], NCI [4], and SNOMED [5] in two modalities: small overlapping fragments, and whole ontologies. The evaluation is based on reference alignments derived automatically from the UMLS Metathesaurus [17].
The Disease and Phenotype track includes two tasks, one consisting in mapping the Human Disease Ontology (DOID) [34] to the Orphanet and Rare Diseases Ontology (ORDO), and another consisting of mapping the Human Phenotype Ontology (HP) [30] to the Mammalian Phenotype Ontology (MP) [32]. The evaluation carried out in the OAEI 2016 was primarily based on consensus alignments that include all mappings found by either 2 or 3 participating matching systems.

Settings

To evaluate the impact of the various challenges of matching biomedical ontologies and the strategies for tackling them, we conducted a number of tests, which are further detailed in the “Results” section.

All tests were carried out in a personal computer with an Intel i5-4570 CPU @ 3.20GHz, with 10GB RAM allocated to Java, and Windows 7 64-bit operating system. Except were otherwise noted, the StringMatcher was run concurrently on 4 CPU threads, and all other matching algorithms were run using a single CPU thread.

When AML’s complete matching pipeline is mentioned, it refers to the matching pipeline employed for the OAEI 2016 [12]. The sources of background knowledge available to AML were also the same as it used in the OAEI 2016: the Uber Anatomy Ontology (UBERON) [2], the Human Disease Ontology (DOID) [34], and the Medical Subject Headings (MeSH) [35].

Tests where only the run time was being assessed were carried out in all datasets. Tests where the F-measure was being assessed were carried out in only the Anatomy and Large Biomedical Ontologies datasets (except where otherwise noted) since a consensus alignment, as used in the evaluation of the Disease and Phenotype track, was deemed insufficiently accurate for the purpose of this study.

In the final test of this study, we performed a manual evaluation of the mappings found uniquely through logical definitions from the HP-MP task (as logical definitions are only available for the ontologies in this task). These mappings were produced with older versions of the logical definitions of the HP ontology, which mapped to the FMA rather than to UBERON. Thus to derive HP-MP mappings based on logical definitions, the cross-references between UBERON and FMA were used to provide correspondences between the logical definitions, when the definitions were otherwise identical.

Results

Efficiency tests

Hash-based searching versus pairwise comparisons

In order to compare the efficiency of hash-based searching with traditional pairwise comparison algorithms, we implemented a functional equivalent of AML’s LexicalMatcher that makes pairwise equality comparisons instead of hash-based searches. We compared the run time of this QuadraticLexicalMatcher (running concurrently on 4 CPU threads) with that of the LexicalMatcher. Furthermore, we performed a power law regression of the run times of the two approaches as function of the number of lexical entries in the matching task.

The results of this comparison are shown graphically in Fig. 3, and detailed in Table 2. The difference in scale between the run times of the two approaches is readily apparent, as the LexicalMatcher runs in under a second for even the largest tasks whereas the QuadraticLexicalMatcher ranges from 9 seconds for the Anatomy task to over 4 hours for the three whole ontologies tasks of the Large Biomedical Ontologies. The power law regressions reveal that the LexicalMatcher has a sub-linear behavior (exponent 0.75) as function of the number of lexical entries, whereas the QuadraticLexicalMatcher has a near-quadratic behavior (exponent 1.8).

Table 2 Run time comparison between the hash-based LexicalMatcher (in milliseconds) and its functional equivalent QuadraticLexicalMatcher that performs pairwise comparisons (in seconds), on all biomedical OAEI 2016 tasks

Full size table

Local versus global string matching

The application of traditional string matching algorithms, such as ISub [27], requires pairwise comparisons and thus is not scalable. Thus many matching systems forgo their use, instead opting for approximations based on hash searches (such as n-gram overlap). AML is able to make use of string matching algorithms by employing them locally, in the vicinity of mappings derived through hash-based searching. The expectation is that this local matching strategy scale approximately linearly with the size of the ontologies (in number of classes rather than lexical entries, as it is at its core a structural algorithm). To assess whether that is the case, we measured the run time of AML’s StringMatcher when used (locally) in its full matching pipeline and performed a power law regression as function of the number of classes (of the input ontology with the most classes).

The results of this regression, shown in Fig. 4, reveal that the behavior of the local StringMatcher is on average sub-linear (exponent 0.76), and while there is substantial variation from this behavior, it is bound by O(n log(n)). In more concrete terms, the local StringMatcher runs in 13 seconds in the worst case, whereas the global algorithm has an expected run time of over 8 hours for the three whole ontologies tasks of the Large Biomedical Ontologies.

To assess the effectiveness of applying string matching only locally in comparison with applying it globally, we ran AML’s full matching pipeline replacing the local StringMatcher with a global run of that algorithm, and compared the F-measure of the pipeline with the global and local variants. We did not run this test on the Large Biomedical Ontologies whole ontologies tasks, as the expected run time of the global StringMatcher in these tasks exceeds 8 hours, and the conclusions drawn from the small overlapping fragments tasks can be extrapolated to these tasks. The results of this comparison are shown in Table 3.

Table 3 Evaluation of AML’s full matching pipeline with the StringMatcher run locally and run globally

Full size table

The results show that, as expected, employing the global StringMatcher has an advantage with respect to recall in most tasks (the exception being Anatomy). The counterpoint is that precision is significantly lower than when the local variant is employed, to the effect that the F-measure is also lower in most tasks (except for FMA-SNOMED small).

Lexical richness tests

All lexical annotations versus primary annotation

The number and variety of lexical annotations per class are a feature of biomedical ontologies that should be taken into account when matching them. In order to assess the impact of this lexical richness, we compared the performance of AML’s LexicalMatcher when using all available lexical annotations (as normal) and when using only the primary name of each class. To avoid introducing an external bias, we turned off AML’s automatic generation of synonyms for this test, so that only the lexical richness of the ontologies themselves is considered.

The results of this test, as shown in Table 4, are conclusive in that the effect of considering all lexical annotations leads to a substantial increase in F-measure in all tasks, ranging from 4.7% in the case of Anatomy to 18.3% in the case of the FMA-NCI small task. They demonstrate that taking into account all lexical annotations of biomedical ontologies is clearly necessary to match them effectively.

Table 4 Comparison between the LexicalMatcher using all class names and synonyms, and using only primary names

Full size table

Weighted versus unweighted lexical annotations

To evaluate the contribution of differentiating between different kinds of lexical annotations, we ran AML’s full matching pipeline with its weighting scheme turned off, and compared the results to those of the normal pipeline. The results of this comparison, shown in Table 5, reveal that the use of Lexicon weights improves the F-measure in all matching tasks except for FMA-SNOMED small where there is a tie. The most extreme case is that of the Anatomy task, where the F-measure increases by 6.3%.

Table 5 Comparison between AML’s full matching pipeline with and without the use of Lexicon weights to score the mappings

Full size table

The contribution of the ThesaurusMatcher

AML’s ThesaurusMatcher exploits the lexical richness of biomedical ontologies to infer new synonyms through automatic lexical composition analysis, and thereby find new mappings. In order to assess the extent to which new knowledge can be generated by such an approach, and how reliable it is, we compared the performance of LexicalMatcher plus ThesaurusMatcher with the performance of the LexicalMatcher alone. The results of this comparison are presented in Table 6.

Table 6 Comparison between the combination of LexicalMatcher and ThesaurusMatcher, and the LexicalMatcher alone

Full size table

We can see that the ThesaurusMatcher leads to a consistent increase in recall, but decrease in precision in all tasks. For Anatomy, and the Large Biomedical Ontologies small tasks the balance is positive, as the resulting F-measure is greater than without the ThesaurusMatcher. For FMA-NCI whole and FMA-SNOMED whole, it is negative, whereas for SNOMED-NCI whole, it is essentially neutral.

Background knowledge tests

Comparison of information sources and usage strategies

There are two main strategies for using background knowledge ontologies: as mediators, or for lexical expansion. There are also two types of information that can be used to map the background knowledge ontologies to the input ontologies: lexical information, and cross-references. We evaluated AML’s full matching pipeline with the background knowledge matching component modified appropriately to cover all four combinations of these two factors. We carried out this evaluation on the Anatomy and FMA-NCI small tasks, as these are the only tasks in which the coverage of the available cross-references from UBERON is comparable to its lexical coverage, and thus for which comparing the two information sources would be fair. The results of this evaluation are shown in Table 7.

Table 7 Evaluation of AML’s matching pipeline in the Anatomy and FMA-NCI small tasks with different combinations of background knowledge information source (lexical vs. cross-references) and usage strategies (mediator vs. lexical expansion)

Full size table

The first observation we can make from the results is that cross-references are the best source of information in both tasks, albeit with different usage strategies, and are better than using lexical information regardless of strategy. The lexical expansion strategy produces strictly worse results than the mediator strategy when based on lexical information. When based on cross-references, it produces a higher recall than the mediator strategy, and in the case of the Anatomy task, a higher F-measure as well.

On the use of logical definitions

Another source of information that can be exploited for matching biomedical ontologies are OBO logical definitions. AML and PhenomeNET [21] both explored the use of logical definitions in the OAEI 2016’s HP-MP task from the Disease and Phenotype track.

Because there is no manually validated reference alignment for the HP-MP task, we assessed the contribution of the LogicalDefMatcher by manually evaluating the mappings found by this matcher and not by AML’s pipeline when this matcher is disabled. We classified each mapping as: equivalent, if the two classes were deemed semantically equivalent; overlapping if the classes were not strictly equivalent (one was slightly broader than the other) but sufficiently similar that a direct mapping between the two would be conceivable depending on the scope of the alignment; or false, if the classes were too dissimilar to be mapped. The full results of this manual evaluation are included in the Additional file 1.

Out of the 92 mappings identified only with the LogicalDefMatcher, we found that 49 were equivalent, and an additional 21 were overlapping, therefore in total 70 mappings were plausibly correct. This gives us a best case precision of 76.1%, and a worst-case precision of 53.3% (if we consider only the strictly equivalent mappings correct). Given that these 92 mappings represent 5% of the total mappings found by AML, their contribution to AML’s recall should be significant even in the worst case, but in the absence of a reference alignment, we cannot determine it, nor can we ascertain whether the contribution of this matcher is positive with respect to F-measure.