Using OWL for a data-validation tool is a slight departure from OWL’s primary aims. Rather than using the ontology to validate the consistency of the knowledge base or to make further inferences from the axioms specified in the knowledge base, the task of a data-validation tool is primarily to verify the adherence of data elements to a set of prerequisite and well-defined rules. Data validation lends itself more to terminological-component (TBox) reasoning in a closed-world scenario.
Given its need to gather and relate information from distributed sources on the web, an important design decision of OWL is its inherent open-world assumption (OWA). In the OWA, a statement is considered to be true unless it is explicitly stated to be false, with the consequence that not everything can be known a priori about a particular entity – there may always be new knowledge that extends the information about it. In contrast, the closed-world assumption (CWA) considers a statement to be false unless it has been explicitly stated as true. This topic has been addressed by others who have proposed a number of practical solutions for dealing with predominantly closed-world scenarios [16, 17], essentially by developing semantics of extended DL knowledge bases using the notion of integrity constraints. Implementation of this extension was however under further investigation at the time. Tao et al. [16] were using SPARQL query answering while the intention of Motik et al. [17] was to implement their approach in the KAON2 DL reasoner.
The main advantage for using OWL lies in its flexibility and expressiveness for creating web ontologies, as well as its basis on the Resource Description Framework (RDF) for linking structures and making them directly accessible on the Web via unique Uniform Resource Identifiers (URIs). Other advantages include the wide availability of reasoning tools for OWL ontologies, important for making inferences from quite a complex set of inter-dependent rules, as well as its ability to import separate ontologies. Cancer registration draws on a number of coding standards and terminologies (including, ICD-10 [18], ICD-O-3 [19], TNM Classification of Malignant Tumours [5], SNOMED CT [20]), and where these exist as separate ontologies OWL is able to import them without having to redefine all the associated entities. The difficulties arising from the OWA did not present a major limitation to the use of OWL in this work, as may be seen in the example scenarios given in the Results section.
Two existing ontologies were considered as a potential basis for the data-validation tool. Both ontologies are highly pertinent to the field of cancer registration but are in preliminary form and undergoing further development. The first was developed as a model for integration of disease classifications in oncology (essentially integrating subsets of ICD-10 and ICD-O-3 terminologies) [21]. The second was developed for the analysis and visualisation of disease courses [13]. The purpose of these two ontologies together with the data-validation work described here address three major concerns of population-based CRs. Population-based CRs record all incident cases of cancer in a well-known population. They collect this data from multiple information sources – such as hospital-discharge and clinical records, pathology reports, and death certificates – and oftentimes have to deal with different systems of disease encoding. The aim of the ontology of Jouhet et al. [21] was to facilitate the task of disease identification independent of the coding system used. The subsequent step is to ensure the validity of data using standardised rules, most of which check inter-variable dependencies in the manner described in this paper. Once the data has been validated, it can then be used in data analyses of the type described by Esteban-Gil et al. [13]. These studies generally select cohorts of patients on the basis of specific criteria (e.g. disease courses, patient outcomes, etc.).
These three processes use the data in quite different ways and for quite different purposes. Whereas the goal should be to unite the concepts in a single CR ontology, further study is required to find an optimum solution that addresses each process without adding inconsistencies in the axioms for the other processes or unnecessary overheads to the automatic reasoning functions. For example, the ontology of Jouhet et al. [21] draws on the North American National Cancer Institute thesaurus (NCIt), included in the Open Biological and Biomedical Ontology (OBO) Foundry [22], and the authors note that the ontology suffers a number of flaws, particularly in the logic-based reasoning and should only be used cautiously. The ontology of Esteban-Gil et al. [13] operates on post-validated data and serves as a potential tool for research and knowledge management. It forms part of a larger more complex system for building the queries via SPARQL and imports classes from the Semanticscience Integrated Ontology (SIO) [23] and the Ontology for Biomedical Investigations (OBI) [24].
Although both the ontologies contained a number of common classes (e.g. those deriving from the ICD-O-3 nomenclature), they were structured in a form that would have proved convoluted or restrictive for the data-validation rules. For example, the ICD-O-3 codes in the disease-classification ontology were modelled as individuals of type equating to ICD-10 classes; and in the disease-courses ontology, morphology codes and behaviour codes were integrated, whereas a number of the data-validation rules refer to separate behaviour codes. More importantly, both ontologies contained many more classes and logical axioms than were required by the data-validation ontology (over 20,000 compared to some 4000 classes; and over 50,000/150,000 as opposed to some 6000 logical axioms), and would have impacted unfavourably on automatic-reasoning performance. Thus, the decision was taken to develop a dedicated ontology for the purpose of this work, particularly with a view to fulfilling the following main three requirements:
-
i)
To provide the means of encapsulating the ENCR data-validation rules in a formal and unambiguous manner;
-
ii)
To facilitate the integrity and maintenance of the rules by ensuring a unique, and uniquely addressable, repository of the rules;
-
iii)
To utilise the automatic reasoning logic for the data validation and supplement it within a standalone computer programme only where the OWA was unable to make the necessary inferences.
The ontology was developed in the OWL sublanguage OWL DL in order to retain decidability to allow complete reasoning, as well as to take advantage of some critical underlying tools - such as the reasoning tools to infer logical consequences from a given set of asserted facts or axioms, and the Protégé ontology editor/user interface [25].
The ontology utilises the sixth edition of the TNM Classification of Malignant Tumours standard [26]; the third edition of the International Classification of Diseases for Oncology (ICD-O-3) [19]; the International Rules for Multiple Primary Cancers [27]; and ENCR recommendations (such as coding of basis of diagnosis) [28]. None of these standards have been formalised in ontologies and for the purpose of this work, the ICD-O-3 codes and the TNM edition 6 codes were recreated as separate ontologies to import into the main ontology.
ENCR-JRC data–validation rules
The validation rules for the European CR core data set are described in [9]. For ease of interpretation, the rules are provided in a series of separate entity-relationship tables, which include:
-
i)
unlikely and rare combinations of age and tumour type – an example of which is the combination of malignant extra-cranial and extra-gonadal germ cells (ICD-O-3 morphologies: 9060–9065, 9070–9072, 9080–9085, 9100–9105) with any of the ICD-O-3 topographies: C00-C55, C57-C61, C63-C69, C73-C750, C754-C768, C80; and age at diagnosis greater than 7;
-
ii)
unlikely sex and topography combinations;
-
iii)
valid combinations for basis-of-diagnosis and morphology and topography, such as basis of diagnosis specified as clinical investigation and ICD-O-3 morphology 9380 and ICD-O-3 topography C717;
-
iv)
valid combinations for morphology and grade, such as ICD-O-3 morphologies: 9719, 9727, 9831, 9948 with grade 8 - NK cell (natural killer cell);
-
v)
morphology codes and allowed topography codes, such as the combination of ICD-O-3 morphologies: 8160, 8161 with ICD-O-3 topographies C221, C239, and C240.
The validation rules also describe checks for permissible combinations of extent-of-disease and behaviour and TNM as well as specific checks for survival analysis and checks forinconsistencies of multiple primary malignant tumours, which if not identified can skew the statistics for incidence.
Whereas presentation of the rules in such a way makes it easier to understand the relationships between a subset of specific entities, transcribing them directly to a semantic data model introduces a degree of inter-coupling between many of the associated entities. Figure 1 illustrates the entitiesFootnote 2 (boxes) comprising the rule tables of [9] and their rule dependencies.
The degree of inter-coupling can be discerned to some extent by the number of relationships between the entities and also by the number of direct and indirect dependencies on common entities. In Fig. 1 for example, “Basis of Diagnosis” has a validation-rule dependence on “Topography”. It also has a dependence on “Morphology” that itself has a dependence on “Topography”. “Basis of Diagnosis” has a further indirect dependence on Topography via “TNM” and “TNM Topography Grouping”. Likewise, “Stage Group” has a direct dependence on “TNM Topography Grouping” as well as a further dependence via “TNM”. The entities involved in these types of dependencies are displayed as shaded boxes. Such dependencies tend to complicate the task of modelling entity-relationships in software and generally result in higher maintenance overheads.
Coupling can be reduced by refactoring some of the dependencies after adding a number of extra data entities. Figure 2 illustrates the situation after adding the extra data entities: “Topography Grouping” that groups topographies for different type of tumours; “Morph-Behaviour” that classifies the possible permutations of morphology and behaviour; and “Tumour Type” that classifies the possible tumour types on the basis of topography groupings and morphology-behaviour. These extra data entities were modelled on SEER’s histology/behaviour description categorisation [29], which is itself based on ICD-O-3.
The model however still suffers some drawbacks. In particular, the topography and morphology entities are not as decoupled as they could be – the reason being that the current rules for basis-of-diagnosis are described granularly in specific terms of morphology/morphology-topography, and age. Also the TNM topography groupings do not currently map in all cases to the topography-grouping definitions that reference to the tumour-type definitions (c.f. Fig. 2); for example, the TNM Topography Grouping (TNM 6th edition) entity for larynx includes the ICD-O-3 topography codes: C320, C321, C322, and C101 whereas the Topography Grouping entity includes codes: C320, C321, C322, C323, C328, and C329. Were these definitions to be redefined, then a cleaner model such as that shown in Fig. 3 could be realised.
Transcribing the rules into the OWL ontology
The OWL ontology was developed on the entity-dependency model described in Fig. 2. The entities shown in the figure formed the main OWL classes and the rules were derived from the entity-relationship tables provided in [9].
The ICD-O-3 and the International Union Against Cancer (UICC) TNM tumour classifications do not currently exist in OWL format. Whereas others have developed OWL ontologies to address this need [13, 30], they are not comprehensive and were created with the specific aims of the respective studies in mind. A strength of OWL – which can nevertheless lead to a potential drawback – is that ontologies can be created in a number of ways, allowing an ontology to be tailored to a specific design need. An ontology tailored to one design constraint is not necessarily easily adaptable. In particular, the ontologies closest to the work presented here are described in [30]; however the ICD-O-3 ontology contained relatively few morphology classes and the TNM ontology (TNM edition 6) sub-classed the topographies under the stage groups. Our study required the full complement of the ICD-O-3 morphologies and for the TNM ontology it was preferable to sub-class the stage groups under the TNM topography groupings (since the former are generally dependent on the latter). It was therefore necessary to recreate separate ontologies for the OWL classes that mapped to ICD-O-3 and TNM. For the latter, edition 6 was used but the other editions (7 and 8) could be developed on the same basis.
The ability to import other ontologies from within a given ontology is nevertheless an important feature of OWL and will allow much faster development times as and when ontologies of standards suitable to cancer registration become available.
The classes and relationships were created using the Protégé tool. The expressivity of the description logic used in the ontology equated to SHIQ(D), which is less than OWL-DL’s full expressivity of SHOIN (D), but the validation tests did not require the functionality either of nominals (O) or of cardinality restrictions (N).
OWL provides a number of ways whereby rules can be encoded in an ontology. The rules however need to distinguish between what the JRC-ENCR validation process considers errors and warnings. The three following scenarios are used for handling CR data-coding violations: (i) direct violation of a strict rule resulting in a CR coding error via an OWL disjoint statement; (ii) “soft” violation of a rule via an unlikely condition that prompts a warning via an OWL restriction statement; and (iii) the conjunction of a number of conditions that together do not allow the code allocated by the cancer registry via OWL’s class-subsumption mechanism.
Scenario (i) can be handled simply by specifying certain classes disjoint from each other. For example the requirement that in situ behaviour (code 2) must have basis-of-diagnosis given as ether via cytology (code 5) or histology of primary tumour (code 7) can be encoded by the statement using description-logic syntax:
This statement essentially states that any basis-of-diagnosis, other than code 5 or code 7, which is associated with a behaviour of type “in situ” (code 2) is a member of the empty set.
Scenario (ii) can be handled via sub-classing. The class with a “soft” condition can be made a sub-class of a restriction. Thus, to model the fact that hepatoblastoma is unlikely to occur above the age of 5, its associated class can be made a sub-class of the restriction:
Scenario (iii) can be handled using defined classes. Defined classes are those that describe the necessary and sufficient conditions for any other class to be subsumed by them. It is the means by which a closed-world set of conditions can be expressly stated in OWL. For example, the definition for a basis-of-diagnosis described by the code “Clinical” can be specified by the defined class:
which means that if the parameters specified by an individual CR case record include a clinical basis-of-diagnosis (Code1_ClinicalDiagBoD) and morphology either of 8000, 8720, 9140, 9590, or 9800 and a behaviour not given by “in situ” (c.f. axiom above for basis-of-diagnosis codes and behaviour code: Code2_InSituBehaviour) [28], then the OWL reasoner will be able to subsume the record under the class BoDCode1 and thereby validate the basis-of-diagnosis field of the record. Conversely if the morphology field of the record is outside these prescribed values, or a behaviour of in situ (code 2) is specified, the reasoner will not be able to subsume the record under the BoDCode1 class and a mismatch in the basis-of-diagnosis code can be inferred.