OntoCheck: verifying ontology naming conventions and metadata completeness in Protégé 4
© Schober et al.; licensee BioMed Central Ltd. 2012
Published: 21 September 2012
Although policy providers have outlined minimal metadata guidelines and naming conventions, ontologies of today still display inter- and intra-ontology heterogeneities in class labelling schemes and metadata completeness. This fact is at least partially due to missing or inappropriate tools. Software support can ease this situation and contribute to overall ontology consistency and quality by helping to enforce such conventions.
We provide a plugin for the Protégé Ontology editor to allow for easy checks on compliance towards ontology naming conventions and metadata completeness, as well as curation in case of found violations.
In a requirement analysis, derived from a prior standardization approach carried out within the OBO Foundry, we investigate the needed capabilities for software tools to check, curate and maintain class naming conventions. A Protégé tab plugin was implemented accordingly using the Protégé 4.1 libraries. The plugin was tested on six different ontologies. Based on these test results, the plugin could be refined, also by the integration of new functionalities.
The new Protégé plugin, OntoCheck, allows for ontology tests to be carried out on OWL ontologies. In particular the OntoCheck plugin helps to clean up an ontology with regard to lexical heterogeneity, i.e. enforcing naming conventions and metadata completeness, meeting most of the requirements outlined for such a tool. Found test violations can be corrected to foster consistency in entity naming and meta-annotation within an artefact. Once specified, check constraints like name patterns can be stored and exchanged for later re-use. Here we describe a first version of the software, illustrate its capabilities and use within running ontology development efforts and briefly outline improvements resulting from its application. Further, we discuss OntoChecks capabilities in the context of related tools and highlight potential future expansions.
The OntoCheck plugin facilitates labelling error detection and curation, contributing to lexical quality assurance in OWL ontologies. Ultimately, we hope this Protégé extension will ease ontology alignments as well as lexical post-processing of annotated data and hence can increase overall secondary data usage by humans and computers.
With the advent of the semantic web and RDF-based knowledge representation techniques off-the-shelf ontology editors like Protégé 4  gain widespread use. Although its functionality is sufficient for daily ontology editing tasks, some pre-release clean-up checks on the ontology, especially in the area of class naming conventions and metadata availability, can complement Protégé 4 in a useful way. It was shown that inconsistencies in naming conventions can impair readability and navigability of ontology class hierarchies, and even hinder their alignment and integration . An initial specification for typographic, syntactic and semantic naming conventions for life science ontologies  has been introduced by the OBO Foundry . It was shown that clear naming conventions for editor-preferred class names (e.g. stored in the rdfs:label or rdf:ID/OWLClassName) provide guidance to ontology creators and help developers avoid flaws and lexical inaccuracies  when editing, but especially when interlinking ontologies. By increasing the robustness and exportability of ontology class labels, adherence to explicit class naming conventions can foster communication when ontology engineers need to collaborate with external groups to align their ontologies and facilitate the import and usage of classes from external ontologies or imported ontology modules. Naming conventions increase the robustness of context-based text mining for automatic term recognition and text annotation and they ease the manual and automated integration of terminological artifacts, i.e. comparison, orthogonality-checking, alignment and mapping. Robust labeling generally eases the access to ontologies through meta-tools such as provided by the NCBO BioPortal , i.e. by reducing the diversity with which these tools have to deal, thus reducing the burden on tool and ontology developers alike. Ultimately, following clear labeling guidelines can facilitate ontology re-use and reduce redundant development.
Another area that can profit from tool support is metadata enrichment: Although 'expensive' to add, metadata stored along a class in self-defined annotation properties or standardized elements provided by metadata policy providers like Dublin Core  will ease the human understanding of the editorial, administrative and semantic nature of ontologic entities. Before a new ontology version is released for public use, it should be checked if all metadata elements that are mandatory within a particular design principle documentation, e.g. annotation properties like natural language definitions or class labels, are present in the ontology and the ontology is hence assumed to be sufficiently described for the human user.
Based on own previous experience, we think the actual status of metadata completeness and labelling consistency can be improved, especially where lack of compliance is due to missing software capabilities. This need for tool support is also exemplified by pre-release tests implemented independently within different groups to check on metadata availability and labelling consistency, e.g. as seen in the OBI project  and in the Disease Ontology project  respectively.
To assist ontology editors in complying with metadata requirements and naming conventions outlined in their style guides and design principle documentations, we here introduce a Protégé plugin that checks an OWL ontology loaded into Protégé against naming conventions and metadata completeness specified by the user. Specifically, our plugin intends to contribute to lexical harmonization by validating class names according to specified checks. We here present the OntoCheck plugin, which intends to ensure naming consistency by testing for defined label patterns and allows for amendments in the area of metadata analysis.
The OntoCheck plugin was implemented as a plugin for the Protégé 4.1 ontology editor using the Protégé OWL API (version 3.2.2) and Java version 1.6.0_22. An informal requirement analysis was conducted on the basis of the OBO Foundry naming conventions  and on-going editing work in the different ontology engineering projects the authors were involved in. To test and to quantify OntoChecks capabilities, as well as gather further requirements, we applied the plugin within different projects and investigated the following six ontologies: Biotop , DCO , NTDO , GoodRelations , Vehicle Sales Ontology , and @neurist ontology . For each, we created, stored and applied a different set of checks.
The OntoCheck plugin is available for download at our website (http://www.imbi.uni-freiburg.de/ontology/OntoCheck/).
Requirements for a naming convention and metadata verification tool
Aspects met and Implementation
Easy installation, usage and intuitive navigation.
Protégé plugin, structured into 3 self-explaining tabs. Tooltips providing on-the-spot guidance.
Generation and display of numeric counts for selectable ontology metrices.
Making use of the Protégé and Java API, diverse metrices are available, amending the already present 'Ontology Metrics'.
Selection of an 'entry class node' from where on - leaf-wards - a check should be done.
Allows to test for a certain postfix e.g. '_Disposition' only within a selected 'Disposition' entry node sub-tree. Allows checking for metadata availability in selectable subtrees.
Display of classes failing a specified test and export as list.
Found classes can be sorted according to different criteria and exported for later curation.
Display of quantitative results on detected issues in terms of absolute and percentage counts in a given subtree.
A statistical data pane verbalizes the numerical results in a copyable natural language sentence.
Storage and reload capabilities for created checks allowing for later re-use and propagation.
An xml file is generated storing all checks in a reproducible way.
Detection for 'presence' and 'required cardinality' of labels and metadata.
Checks are available on OWL elements capturing lexical information, i.e. rdf:ID, rdfs:label, own annotation properties and standard annotation properties e.g. from Dublin Core or SKOS.
Check for syntactical and typographical patterns and label length i.e. to discover too short or too long names within string values of selectable entities.
Allows checking naming conventions via simple string matches and full regular expressions. Checks the length of labels. A significant fraction of the OBO Foundry naming conventions can be checked, i.e. case, separator but also morphemic conventions.
Detection and counts of redundant class labels.
Label repetition can be checked for via the ComparePanel.
Comparison of values between pairs of entities to detect similarities and avoid redundancies.
Operators like equals, contains or starts with can be used to compare selectable entities.
Quantification of ontology measures useful for ontology evaluation, progress monitoring and complexity analysis.
Displays the percentage or absolute number of entities having 'exactly', 'at least' or 'at most' a certain number of annotation properties, direct sub-/superclasses, or 'usages', i.e. indicating 'hub nodes'.
Testing for cardinalities and metadata completeness
Testing for lexical patterns in names with regular expressions
The ability to correlate an entity with standard Java regular expression as listed for java.util.regex.Pattern  can be used to check names for the presence or absence of specific lexical prefix, infix or postfix patterns. E.g. a regular expression of the form .*ValueRegion|.*Region can be used to test for explicitness in labels, .i.e. all 'ValueRegion' subclass names should contain either the explicit postfix 'ValueRegion' or 'Region'. This function also allows to detect 'metalevel' postfixes like '_class', '_type', '_concept', or '_relation'. Also stop-words like 'A' and 'the', as well as Boolean operators ('and', 'or'), and lexical indications for negations ('non', 'anti 'or 'dis') can be detected and abolished from names.
Checks for minimum and maximum character and word count can identify potentially unclear names, e.g. being shorter than 4 characters or unreadable names longer than e.g. 50 characters or 10 words. Checks for punctuation, e.g. if dots are present, allow for the detection of abbreviations, while all-upper-case-checks can detect acronyms. Checks for cardinality indicators within names could be used in a semantic analysis guiding expressivity selection, e.g. words indicating cardinality requirements, such as 'minimal', 'maximal' might hint for the selected OWL EL profile not to be sufficient.
Testing for typographic naming conventions
The Check panel allows verifying whether a particular naming convention is fulfilled for a chosen entity, e.g. if all values for the rdf:ID/OWLClassName in a selected subtree comply to an 'all-lower-case-underscore-separator' convention. We here list the typographic and syntactic checks possible:
Word Separator: An entity can be checked for none, space, hyphen, underscore and dot separator conventions.
Word Case: An entity can be checked for all lower case, ALL UPPER CASE, Upper case start, camel Hump and Camel Case conventions.
Digits: An entity can be checked for numbers in labels, e.g. to look for cardinality and order indicators.
Comparing values between specified entities
Quantifying ontology measures for ontology evaluation
Counting the 'usages' of a class, e.g. listing all classes with no 'usages' in restrictions other than subclassing named classes allows detecting 'ontological isolates' that have no dependencies. As such orphans are ignored by other logical definitions; they could potentially be removed or hidden in a simplified view of an ontology, focusing on the ontologies' defined and embedded classes linked via object properties. Analyzing the amount of richly axiomatized classes, - so-called 'hub nodes' - helps to determine how much work was put into an ontologies computer accessible semantics. Listing hub nodes that have many in- and outgoing relations also provides a proxy for the core domain described in the ontology, as these are likely to represent the more important classes in a formalized domain. As an application is likely to focus on these 'key classes', particular care must be taken to ensure that domain coverage is of sufficient granularity here.
The OntoCheck user interface
The OntoCheck plugin provides a new editing tab within Protégé and is organized into the three subpanels Check, Compare and Count (see Figures 1, 2, 3), being largely self-explainable and easy to understand and use. Tooltips are displayed for most items upon 'mouse-over object' actions. Each tab shows the class hierarchy pane to let a user select an entry node and the annotations pane in order to make the amendments as required by the test results. Each pane allows specifying the check pattern in the left half and provides the test results in the right half of the pane.
All specified check constraints are stored in a 'history list' and can also be stored in an autogenerated external XML file, as an editor is likely to do the same check on an ontology repeatedly, i.e. as pre-release check. The stored check-specification file can also be exchanged and shared among a group of developers. All result classes can be sorted alphabetically or according to hierarchy position. Result lists can subsequently be enriched with the lacking metadata; either directly or they can be exported as txt file and distributed among curators for later or concurrent curation.
The main tab for curating naming issues is the first panel opened per default, the Check panel. The Compare panel allows comparing the values for specified entities and the Count panel allows measuring how often a class is used in formal definitions. Additional screenshots can be found on the OntoCheck website.
Testing the OntoCheck tool
Exemplary OntoCheck tests with quantification of detected violations
Upper case start
Max Char Count < 20
The usefulness of ontology design principles in general, and naming conventions in particular, increases considerably when supported by ontology editing tools. This had been shown earlier, e.g. for the Kismeta Validator , which was developed under a related paradigm, but focused on XML schemata and DB labels.
Looking at the practical application scenarios with examples outlined in the result section, we see that the OntoCheck plugin meets most of the desired specifications. It helped in discovering and alleviating labeling errors, fostered metadata enrichment and allowed to investigate an ontologies formal expressivity. Specifically, the plugin allows for word case and delimiter checks, regular expression matching (affix checks), cardinality and entity comparison checks.
Of the sixteen OBO Foundry naming conventions  six could be checked with our plugin (nearly 40%) . The remaining conventions, that OntoCheck was not able to check for, would rely on a thorough lexical analysis requiring a lexicon, which is not yet implemented in this version of the plugin. However this could be amended by integrating the LiLa framework for 'linguistic analysis of entity labels in ontologies' , providing an interface to various natural language processing tools and resources for deeper terminological analysis.
Rendering labels in ontologies more consistent will pave the way for tools that use lexical information in class names for ontology integration, formalization and inconsistency detection, e.g. like OBOL , which recommends logical definitions for new classes and cross-products by exploiting lexical information from labels. Discussions have started in the OBO domain, where OORT, the OBO Ontology Release Tool is currently being developed  to include such label checks into their release tool. OntoCheck would make a useful addition to this tool, given its functionality would be delivered as a standalone Java library using solely the OWL API, rather than using the Protégé API.
Lexical ontology alignment tools such as the PROMPT tool suite  will be served with more robust information making automatic alignment and integration easier and more reliable. Recently, ontology alignment and transformation techniques have been designed that explicitly rely on naming structures over the ontology graph , and thus will particularly benefit from a prior clean-up.
As long as accepted recommendations for certain combinations of single naming conventions are not available, we can only enable checks on a per-convention basis, rather than allowing multiple checks simultaneously, e.g. defined in overall naming convention sets, e.g. the Foundry vs. Manchester vs. Stanford style convention sets. If naming conventions were accessible in a standardized repository, one could envision checks and enforcements of whole naming schemes to be drawn from such libraries. In this regard, we have joined forces with the ontology design pattern community  to transform naming conventions into formal reusable Naming ODPs. We also investigate the reimplementation of parts of OntoCheck as a webservice in order to foster integration into Semantic Web portals like Watson , which would ease reuse for portal and library providers, as semantic metrics can be updated continuously and used for ontology comparison, evaluation, ranking, e.g. helping to select compatible artefacts with similar design principles to be aligned or merged easily.
At the moment the user has to amend violating labels manually, but for many cases names violating tests could be corrected (semi-)automatically in an 'OntoCure-mode' in the future. For an extensive and updated list of desired and upcoming features, please visit the OntoCheck webpages.
Although in an early development stage, the OntoCheck plugin proved already useful in carrying out pre-release checks for ontologies in different projects [9–14]. It has helped alerting developers on labelling violations and contributed in keeping these ontologies clean from naming errors. It also rendered the ontologies more complete by curing the lack of metadata. Carried out as pre-release check, the OntoCheck tests contributed to quality assurance  in the mentioned projects. Ultimately, we hope this Protégé extension will contribute to secondary data usage by rendering class names more robust and consistent, hence easing lexical post-processing of annotated data.
Availability and requirements
Project name: The OntoCheck Plugin
Project home page: http://www.imbi.uni-freiburg.de/ontology/OntoCheck/
Operating system: Platform independent
Programming language: Java
Other requirements: Java 1.5.1 or higher, Protégé 4.1 or higher
License: GNU GPL
This work was supported by the Deutsche Forschungsgemeinschaft (DFG) open access publication fund, DFG grant JA 1904/2-1, SCHU 2515/1-1 GoodOD (Good Ontology Design). Vojtěch Svátek is supported by the CSF under P202/10/1825 (PatOMat).
This article has been published as part of Journal of Biomedical Semantics Volume 3 Supplement 2, 2012: Proceedings of Ontologies in Biomedicine and Life Sciences (OBML 2011). The full contents of the supplement are available online at http://www.jbiomedsem.com/supplements/3/S2.
- The Protégé Ontology Editor and Knowledge Acquisition System. last accessed Feb. 9, 2012, [http://protege.stanford.edu/]
- Tuason O, Chen L, Liu H, Blake JA, Friedman C: Biological nomenclatures: a source of lexical knowledge and ambiguity. Pac Symp Biocomput. 2004, 238-249.Google Scholar
- Schober D, Smith B, Lewis SE: Survey-based naming conventions for use in OBO Foundry ontology development. BMC Bioinformatics. 2009, 10: 125-10.1186/1471-2105-10-125.View ArticleGoogle Scholar
- Smith B, Ashburner M, Rosse C: The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol. 2007, 25 (11): 1251-1255. 10.1038/nbt1346.View ArticleGoogle Scholar
- The NCBO Bioportal. last accessed Feb. 9, 2012, [http://bioportal.bioontology.org/]
- Dublin Core Metadata Element Set, Version 1.1. last accessed Feb. 9, 2012, [http://dublincore.org/documents/dces/]
- Ontology for Biomedical Investigations (OBI):. http://obi.sourceforge.net/, here see http://sourceforge.net/tracker/?func=detail&aid=3258610&group_id=177891&atid=886178, last accessed Feb. 9, 2012
- Main Page Disease Ontology WIKI. http://do-wiki.nubic.northwestern.edu/index.php/Main_Page, here see http://do-wiki.nubic.northwestern.edu/index.php/Style_Guide , last accessed Feb. 9, 2012
- Beißwanger E, Schulz S, Stenzhorn H, and Hahn U: BioTop: An Upper Domain Ontology for the Life Sciences. Applied Ontology. 2008, 3 (4): 205-212.Google Scholar
- Schober D, Boeker M, Bullenkamp J et al: The DebugIT core ontology: semantic integration of antibiotics resistance patterns. Stud Health Technol Inform. 2010, 160 (Pt 2): 1060-4.Google Scholar
- NTDO - Neglected Tropical Disease Ontology. http://www.cin.ufpe.br/~ntdo/, last accessed 20.01.2012
- Hepp M: GoodRelations: An Ontology for Describing Products and Services Offers on the Web. EKAW '08, Proceedings of the 16th international conference on Knowledge Engineering: Practice and Patterns. 2008, Springer, LNCS 5268, 329-346.View ArticleGoogle Scholar
- Vehicle Sales Ontology. http://www.heppnetz.de/ontologies/vso/ns, last accessed 20.01.2012
- Boeker M, Stenzhorn H, Kumpf K, ijlenga P, Schulz S, and Hanser S: The @neurIST Ontology of Intracranial Aneurysms: Providing Terminological Services for an Integrated IT Infrastructure. AMIA Annual Symposium Proceedings. 2007, 56-60.Google Scholar
- Pattern Class Java API documentation (Java 2 Plattform SE v1.4.2). http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html, last accessed 20.01.2012
- Schober D, Svátek V, Boeker M: Checking Class Labels against Naming Conventions: First experience with the OntoCheck Protégé plugin. Proceedings of the International Conference on Biomedical Ontology, ICBO 2012. 2012, accepted paper, Graz, AustriaGoogle Scholar
- Kismeta Validator v1.1b, Enterprise Data Standards Validation and Enforcement. http://www.kismeta.com/Validtr.html, last accessed 20.01.2012
- LiLA (Linguistic Label Analysis) framework for the linguistic analysis of phrases that can occur as class or property labels in ontologies. http://code.google.com/p/lila-project/, last accessed 20.01.2012
- Mungall CM: Obol: Integrating Language and Meaning in Bio-Ontologies. Comparative and Functional Genomics. 2004, 5: 509-520. 10.1002/cfg.435.View ArticleGoogle Scholar
- Introduction to the Obo Ontology Release Tool. http://code.google.com/p/owltools/wiki/OortIntro, last accessed 20.01.2012
- Noy NF, Musen MA: Anchor-PROMPT: Using Non-Local Context for Semantic Matching. Proceedings of the Workshop on Ontologies and Information Sharing, 2001, Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-2001), Seattle, WA, SMI technical report SMI-2001-0889. 2001, , Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-2001), Seattle, WA, SMI technical report SMI-2001-0889Google Scholar
- Šváb-Zamazal O, Svátek V, Iannone L: Pattern-Based Ontology Transformation Service Exploiting OPPL and OWL-API. EKAW. 2010, Lisbon, Portugal. Springer LNCS 6317, 105-119. - 17th International Conference on Knowledge Engineering and Knowledge ManagementGoogle Scholar
- Ontology Design (ODP). http://ontologydesignpatterns.org/wiki/Main_Page, last accessed 20.01.2012, [http://Patterns.org]
- d'Aquin M, Gridinoc L, Angeletou S, Sabou M, Motta E: Characterizing Knowledge on the Semantic Web with Watson. EON'07 Workshop at ISWC'07. 2007, [http://watson.kmi.open.ac.uk/editor_plugins.html]Google Scholar
- Rogers JE: Quality assurance of medical ontologies. Methods Inf Med. 2006, 45: 267-274.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.