Building a model for disease classification integration in oncology, an approach based on the national cancer institute thesaurus
© The Author(s) 2017
Received: 12 August 2016
Accepted: 11 January 2017
Published: 7 February 2017
Identifying incident cancer cases within a population remains essential for scientific research in oncology. Data produced within electronic health records can be useful for this purpose. Due to the multiplicity of providers, heterogeneous terminologies such as ICD-10 and ICD-O-3 are used for oncology diagnosis recording purpose. To enable disease identification based on these diagnoses, there is a need for integrating disease classifications in oncology. Our aim was to build a model integrating concepts involved in two disease classifications, namely ICD-10 (diagnosis) and ICD-O-3 (topography and morphology), despite their structural heterogeneity. Based on the NCIt, a “derivative” model for linking diagnosis and topography-morphology combinations was defined and built. ICD-O-3 and ICD-10 codes were then used to instantiate classes of the “derivative” model. Links between terminologies obtained through the model were then compared to mappings provided by the Surveillance, Epidemiology, and End Results (SEER) program.
The model integrated 42% of neoplasm ICD-10 codes (excluding metastasis), 98% of ICD-O-3 morphology codes (excluding metastasis) and 68% of ICD-O-3 topography codes. For every codes instantiating at least a class in the “derivative” model, comparison with SEER mappings reveals that all mappings were actually available in the model as a link between the corresponding codes.
We have proposed a method to automatically build a model for integrating ICD-10 and ICD-O-3 based on the NCIt. The resulting “derivative” model is a machine understandable resource that enables an integrated view of these heterogeneous terminologies. The NCIt structure and the available relationships can help to bridge disease classifications taking into account their structural and granular heterogeneities. However, (i) inconsistencies exist within the NCIt leading to misclassifications in the “derivative” model, (ii) the “derivative” model only integrates a part of ICD-10 and ICD-O-3. The NCIt is not sufficient for integration purpose and further work based on other termino-ontological resources is needed in order to enrich the model and avoid identified inconsistencies.
KeywordsNCI thesaurus Oncology Terminology Semantic integration ICD-10 ICD-O-3
With the increasing adoption of electronic health records (EHRs), the amount of data produced at the patient bedside is rapidly increasing. These data provide new perspectives to: create and disseminate new knowledge; consider the implementation of personalized medicine and offer to patients the opportunity to be involved in the management of their own medical data . Secondary use of biomedical data produced throughout patient care is an essential issue  and is the subject of numerous studies over several years [1–6]. Since 2007, the American Medical Informatics Association emphasized the value of secondary use of medical data: “Secondary use of health data can enhance healthcare experiences for individuals, expand knowledge about disease and appropriate treatments, strengthen understanding about the effectiveness and efficiency of our healthcare systems, support public health and security goals, and aid businesses in meeting the needs of their customers” .
In the oncology field, it is necessary to identify and describe incident cancer cases within a population in order to facilitate research and public health monitoring. For instance, cancer registries exhaustively record incident cases of cancer in a given territory (which correspond to all new cancer cases occurring over a geographical territory). This task remains time consuming if it is performed manually. As early as 1998, a technical report was drawn up by the International Agency for Research on Cancer describing the methods used by different registries for establishing automated procedures to identify new cases using available data . Methods have been proposed for automatically identifying and registering cancers using structured data indexed with standard terminologies [8–12].
However, many different medical specialties are contributing to record information in EHRs. As a result, within EHRs, data describing diseases are recorded according to multiple heterogeneous terminologies even for a single disease occurring in a single patient. For instance, in France, reimbursement data use the 10th revision of the International Statistical Classification of Diseases and Related Health Problems (ICD-10)  to describe diseases, whereas pathology data use either ADICAP (a French pathology terminology) or the 3rd edition of the International Classification of Diseases for Oncology (ICD-O-3)  and data from multidisciplinary meetings in oncology use ICD-O-3. Providing an integrated access to these disease classifications may improve automated cancer identification.
Although ICD-10 and ICD-O-3 both describe cancer diseases, they exhibit differences in terms of structure and granularity. Thus, it is necessary to identify or to build a resource that allows the integration of cancer disease classifications, taking into account these heterogeneities. To achieve this goal, relations must be defined between the involved concepts, such as “a neoplasm is a disease and has a specified morphology as well as a specified topography”. The National Cancer Institute thesaurus (NCIt)“provides reference terminology covering vocabulary for clinical care, translational and basic research, and public information activities”(cited from http://ncit.nci.nih.gov/, visited 2015-01-22). It is described as “a controlled terminology which exhibits ontology-like properties in its construction and use” . These characteristics “open up the possibility [...] in linking together heterogeneous resources created by institutions external to the NCI” . Thus, the NCIt could be used as a resource to bridge the gap between disease classifications, which are structurally heterogeneous.
However, since 2005, it has been shown on many occasions that the NCIt remains flawed [16–18] and especially that logic-based reasoning over the NCIt should be used cautiously. On the other hand, re-building a model “from scratch” would be time consuming and comes with no guarantee of avoiding inconsistencies. Despite the limitations described above, the NCIt contains knowledge that could be useful for our integration purpose. In this manuscript, we propose an approach to build a resource based on a subset of the NCIt, linking the three axes that refer to diseases as described in ICD-10 and ICD-O-3, i.e., the diagnosis as well as its morphology and its topography.
The behavior (Malignant) which is part of the morphology description.
The site of origin (upper-inner quadrant of breast) which corresponds to the topography.
ICD-O-3 is a multi-axial classification used in cancer registries in order to record the anatomic site (topography) and the morphology of a neoplasm. The morphology is coded with five digits. The first four digits represent the histological description and the fifth digit indicates the behavior (i.e. whether benign or malignant) of a neoplasm. As a result, it is not possible for a morphology to have multiple behaviors. “The topography code indicates the site of origin of a neoplasm; in other words, where the tumor arose” . From the ICD-O-3 “point of view”, any morphology code can be associated with any topography code. Some tumor morphologies have a “usual primary site” but it is expressly stated that these associations are provided only to help coders and should not be considered as systematic (and unique) topography-morphology combinations. An example is given in : “An unusual, but possible, example would be the diagnosis ’osteo-sarcoma of kidney’, for which the kidney topography code (C64.9) would be used instead of ’bone, NOS’ (C41.9) […]”. Thus, ICD-O-3 describes a disease by combining the morphology of the tumor and the topography from where the tumor arises. As a result, each neoplastic disease is not described as a whole concept entailed by a unique code within ICD-O-3.
Concepts involved in ICD-10 and/or ICD-O-3
Even if they are called “disease classifications”, ICD-10 and ICD-O-3 are in fact used within EHR for recording diagnoses. The diagnosis is a way for the physician to describe the disease, which corresponds to an evolving process but, in fact, it is not the disease itself. A single disease may have multiple diagnoses all along its clinical course (for instance, an in situ neoplasm may evolve and become a malignant invasive neoplasm) but the disease (process) remains the same. Thus, when used in this context, disease classifications are, in fact, kinds of diagnoses which can be viewed as opinions about the undergoing disease. This assertion is in accordance with the definition proposed by Scheuermann et al. in  who claim that a diagnosis is a “conclusion of an interpretive process that has as input a clinical picture of a given patient and as output an assertion to the effect that the patient has a disease of such and such a type. A diagnosis is a continuant entity that, once made, will survive through time, and is often supplanted by further diagnoses. The diagnostic process is thus iterative: the clinician is forming hypotheses during history taking, testing these during physical exam, forming new hypotheses as a result, and so on.”
The morphology of the tumor, which is a representation of the pathological description of the tumor reported at a given time. Morphology is represented within the ICD-O-3 morphology axis.
The topography of the tumor, which is a representation of the anatomical site of origin of the tumor reported at a given time. Topography is represented within the ICD-O-3 topography axis.
The diagnosis, which is a representation of the reported description of the tumor and encompasses information about both the topography and the morphology of the tumor. Diagnosis is represented as such within ICD-10 and can be built by combining an ICD-O-3 topography and an ICD-O-3 morphology.
Because it is not possible to state that a diagnosis is equivalent to either a topography or a morphology, it is obviously not possible to find equivalences between concepts represented within these two terminologies. The unique correspondences that can be found between ICD-10 and ICD-O-3 concepts are thus a diagnosis (i.e., an ICD-10 code) mapped to a topography-morphology combination (i.e., a pair of an ICD-O-3 topography code and an ICD-O-3 morphology code).
The national cancer institute thesaurus (NCIt)
“NCI Thesaurus (NCIt) is NCI’s reference terminology. NCIt provides the concepts used in caCORE and caBIG to establish data semantics. It covers terminology for clinical care, translational and basic research, and public information and administrative activities. NCIt is also a widely recognized standard for biomedical coding and reference, used by a broad variety of public and private partners both nationally and internationally” .
Some NCIt concepts are annotated as being mapped to some ICD-O-3 morphologies. For example, Invasive ductal carcinoma, not otherwise specified is annotated as being mapped to two ICD-O-3 morphology codes (8500/3 Infiltrating duct carcinoma, NOS and 8521/3 Infiltrating ductular carcinoma). The semantics of this mapping annotation are not defined (i.e., exact match or another type of relationship). In the NCIt, even if the term disease is employed, it is not clear whether Neoplasm represents the disease or the diagnosis. For instance, in the NCI term Browser (https://ncit.nci.nih.gov/ncitbrowser/pages/home.jsf?version=16.10e), Neoplasm is defined as “A benign or malignant tissue growth…” and “An abnormal mass of tissue…”. Disease classifications are mainly used in EHR for diagnoses recording. In the remaining part of this manuscript, we use the NCIt concept Neoplasm as a kind of diagnosis describing the disease.
An OWL-DL representation of the NCIt is freely available in the Web ontology Language (OWL) format on the NCI website (https://cbiit.nci.nih.gov/evs-download/thesaurus-downloads). Although logic-based reasoning can be made with this OWL-DL representation, some inconsistencies have been identified and it has been shown that the NCIt should be used cautiously for this purpose [16–18].
NCI Metathesaurus 
The NCI Metathesaurus (NCIm) is a biomedical terminology database “that covers most terminologies used by NCI for clinical care, translational and basic research, and public information and administrative activities” , including ICD-10 and ICD-O-3. It has been built and is maintained by the NCI. Its structure and a significant part of its concepts is based on the UMLS Metathesaurus . Inside the NCIm, elements coming from different terminologies but representing the same biomedical notion are grouped into the same Concept Unique Identifier (CUI).
Defining a formal pattern for linking diagnosis, topography and morphology
Building a model based on the NCIt corresponding to the formal pattern
Instantiating the model with terminologies
Defining a formal pattern for linking diagnosis, topography and morphology
has_morphology: for modeling the relation between a diagnosis and the type of cells (morphology) that are stated to be involved in the tumor described by the diagnosis.
has_primary_site: for modeling the relation between a diagnosis and an anatomical site (topography) that is stated to be the origin of the tumor described by the diagnosis.
Malignant neoplasm of lower outer quadrant of breast ≡
\(\cap \exists \) has_morphology.Malignant neoplasm
\(\cap \exists \) has_primary_site.Lower outer quadrant of breast Reflexive part
\(\cap \exists \) has_morphology.Adenocarcinoma
\(\cap \exists \) has_primary_site.Lower outer quadrant of breast Reflexive part) \(\subseteq \)
Malignant neoplasm of lower outer quadrant of breast
Building a model based on the NCIt corresponding to the formal pattern
Building a part-whole lattice
Lower outer quadrant of breast Reflexive part ≡
Lower outer quadrant of breast
\( \cup \exists \) part_of.Lower outer quadrant of breast
The lattice was built with DL-reasoning over classes defined using two part-whole relationships available in the NCIt (namely anatomic_structure_is_physical_part_of and anatomic_structure_has_location).
[NCIt diagnosis concept (refined)] ≡
\( \cap \exists \) disease_has_finding.[Morphology mapped to the NCIt concept]
Adenocarcinoma (refined) ≡
\( \cap \exists \) disease_has_finding.Adenocarinoma,NOS (ICD-O-3 morhology)
Each of the morphologies were classified depending on their tumoral behavior as described in ICD-O-3 (i.e., benign, malignant primary, in situ, malignant metastatic, unknown whether benign or malignant, unknown whether primary or metastatic).
Building the model
\(\cap \exists \) disease_has_primary_anatomic_site.Topography_Reflexive_Part
Based on the defined pattern, we implemented and executed the following algorithm:
Reflexive part topographies,
Diagnoses identified by the aforementioned algorithm.
Instantiating the model with disease classifications
Identify mappings between ICD-O-3 topographies and NCIt concepts having the same CUI within the NCIm,
Define these codes as instances of the corresponding NCIt concept,
Retrieve the corresponding reflexive part topography after DL-reasoning.
Identify mappings between ICD-10 codes and NCIt concepts having the same CUI within the NCIm,
- 2.Define concepts corresponding to ICD-10 codes based on the NCIt definition (by adding a restriction for the primary site based on the NCIt concept formal definition) and ICD-10 (by adding a restriction for the behavior) semantics. For instance :
Breast, Unspecified (C50.9) is a malignant primary neoplasm within the ICD-10 classification.
Breast, Unspecified (C50.9) has the same CUI as Malignant Breast Neoplasm (C9335) within the NCIm.
Malignant Breast Neoplasm (C9335) has as associated primary site Breast (C12971) within the NCIt.
The built expression describing Breast, Unspecified (C50.9) was then:
Malignant Breast Neoplasm
\(\cap \exists \) disease_has_finding.Malignant primary neoplasm
\(\cap \exists \) disease_has_primary_anatomic_site.Breast
Retrieve the corresponding diagnosis after DL-reasoning.
Evaluation of the model
For each combination, we queried the proposed model in order to evaluate how many branches of the diagnosis lattice were instantiated by both the ICD-10 code and the topography-morphology combination.
We tried to build mappings based on the proposed model with a simple algorithm (ICD-10 codes and topography-morphology combinations with the minimum hierarchical edge-based distance were considered as mapped) and compared it with the gold standard.
All the analyses were processed over the OWL-DL version of the NCIt (14.11d) available at http://evs.nci.nih.gov/ftp1/NCI_Thesaurus/.
Built model based on the NCIt
A total of 6720 topographies involved in at least one diagnosis definition was identified and the corresponding topographies reflexive parts were introduced in the NCIt in order to build the topography reflexive parts lattice (Section: “Building a part-whole lattice”). A total of 1120 NCIt codes was identified as being related to 1094 ICD-O-3 morphology codes. The 1094 corresponding morphology classes were added to the model and automatically classified under six general morphology classes depending on their behavior leading to a set of 1100 possible morphologies. Combining the 1100 morphology classes with the 6720 reflexive part topographies, 7392000 expressions were built. A total of 20133 (0.27%) expressions subsuming at least one NCIt code were identified and the corresponding classes were introduced in the model as diagnoses.
Instantiating the model with disease classifications
409 ICD-O-3 topographiesTable 1
Part of ICD-O-3 and ICD-10 terminologies integrated within the final model
Instantiating the final model
ICD-10 In situ
873 ICD-O-3 morphologies (excluding /6 Malignant neoplasms, stated or presumed to be secondary and /1 Neoplasms of uncertain and unknown behavior)
727 ICD-10 neoplasms (excluding C81-C96 Malignant neoplasms, stated or presumed to be secondary and D37-D48 Neoplasms of uncertain or unknown behavior)
Using the NCIm, 298 ICD-O-3 topography codes were linked to 540 NCIt codes. Within these NCIt codes, 29 were not subclasses of Anatomic Structure, System, or Substance. Among the 298 ICD-O-3 topography codes, 20 were related only to these 29 codes and were then excluded (e.g., C05.1 Soft palate, NOS was erroneously mapped to Malignant Soft Palate Neoplasm). Thus, 278 topography codes were finally included within the model as instances of the corresponding NCIt codes and classified as instances of topography reflexive parts after DL-reasoning. Using the NCIm, 302 ICD-10 codes were linked to NCIt codes. Building the corresponding expressions and after DL-reasoning, we were able to add 302 ICD-10 codes as instances of 380 diagnoses.
Characteristics of the final model
The resulting model is constituted of 113643 axioms, including 27953 classes (6720 topographies, 1100 morphologies and 20133 diagnoses). A total of 1440 codes were instantiated (278 ICD-O-3 topographies, 860 ICD-O-3 morphologies and 302 ICD-10 codes).
Number of codes instantiating multiple classes in the model
Instances of multiple classes
0 (- %)
ICD-10 In situ
∃ disease_has_finding.Cavernous hemangioma
∩∃ disease_has_primary_anatomic_site.Colorectal Region Reflexive part
∃ disease_has_finding.Cavernous hemangioma
∩∃ disease_has_primary_anatomic_site.Colon Reflexive part
∩∃ disease_has_primary_anatomic_site.Colorectal Region Reflexive part
∩∃ disease_has_primary_anatomic_site.Colon Reflexive part
The explanation for this situation is twofold: (1) there is neither an anatomic_structure_has_location relationship, nor an anatomic_structure_is_physical_part_of relationship between Colon and Colorectal Region within the NCIt; (2) Colon Cavernous Hemangioma is described as having these two anatomic structures as a primary site. On the other hand, through the NCIt diagnosis lattice, Colon Cavernous Hemangioma is described as being a subclass of the concepts Cavernous hemangioma and Hemangioma, NOS.
Comparison with the SEER conversion file
Comparison with the SEER conversion program according to the tumor type (hematopoietic and solid tumors) and the number of branches of the diagnosis lattice that are identified for an ICD-10 code / ICD-O-3 combination
Related in the model*
More than 1 branch a
Mappings rebuilt from the model**
Non unique mappings b
Implemented methods to build the model
We achieved to automatically build a model based on the NCIt, describing topographies, morphologies and diagnoses that can be instantiated by both ICD-O-3 and ICD-10 codes. As no morphological axis is available within the NCIt, we have adapted the NCIt by adding concepts corresponding to ICD-O-3 morphologies, which we related to the corresponding diagnoses (based on the ICD-O-3 annotation of the NCIt).
For the description of topographies, we have built an organ reflexive part lattice that enables the description of a primary site as encompassing the site itself and all its parts. These reflexive parts have been proposed for the biomedical domain in .
The diagnoses lattice was then automatically generated by DL-reasonners based on the topography and morphology lattices avoiding is_a overloading . As a result, the obtained diagnoses subsumption lattice is a valid formal representation of diagnoses hierarchy in respect to the definitions of classes contributing to their description
The NCIt completeness for describing diagnoses. As our method relies on NCIt diagnoses, the resulting classes which were built depend on their existence within the NCIt (e.g., C00.1 External Lower Lip malignant neoplasm is not available within the NCIt).
The NCIm provides a way to identify common concepts using the CUI but it can be incomplete or wrong (e.g., 20 ICD-O-3 topographies were mapped erroneously to non-anatomic concepts).
However, when the codes were found, we were able to identify a common diagnosis for all cases described in the SEER conversion program and, using a simple algorithm, 42% of the SEER mappings (corresponding to codes instantiating the model) could be rebuilt from the model. Our aim is not to enable conversion between codes but to provide a machine usable and semantically integrated view over them. From this perspective, the model is consistent because it provides links between ICD-10 diagnoses and ICD-O-3 topography-morphology combinations when they exist within the SEER conversion file. Moreover, the model describes many more possible relationships between diagnoses than the SEER conversion program does. For instance, in the latter, there is no relationship between Adenocarcinoma, NOS – Colon, NOS (C18.9 – M8140/3) and Malignant neoplasm of rectosigmoid junction (C19.9) whereas our model identifies successfully that they are both instances of Malignant, primary site - Large Intestine Reflexive part.
Choice of the NCIt
In the biomedical field, other description logics-based terminologies exist. Specifically, SNOMED-CT®; provides not only topography, morphology and diagnosis dimensions but also implements relationships between these concepts. However, the NCIt is specific to the oncology field and provides useful knowledge related to neoplasm diagnoses. The NCIt is freely and easily accessible.In contrast, SNOMED-CT has a much more restrictive affiliate license agreement and it is not easily accessible for countries which are not members of the International Health Terminology Standards Development Organisation (IHTSDO). In addition, it has been shown that SNOMED-CT’s formal representation suffers from the same flaws [30, 31] as the NCIt and has to be used cautiously while needing logic-based reasoning. Thus, a similar evaluation could be carried out on SNOMED-CT in order to estimate whether it could be useful for integrating disease classifications in oncology and to compare the results with what was found when using the NCIt.
Limitations of the NCIt for integration purposes
In , Schultz et al. discussed that the OWL-DL version of the NCIt may lead to unexpected results which were not visible due to the lack of use cases needing logic-based reasoning over the OWL-DL version of the NCIt. The integration of heterogeneous disease classifications corresponds to such a use case. We have identified some limitations due to inconsistencies.
On the one hand, the NCIt provides concepts describing cancer diagnoses and, on the other hand, concepts describing the tumor topography. It also provides relationships which are involved in topography-morphology combinations, themselves expected to be equivalences of diagnoses. Its formal representation and the availability of an OWL version enable reasoning and the implementation of DL-queries. However, some intrinsic characteristics prevent its direct use for the integration of cancer disease classifications: (i) the absence of distinction between morphologies and diagnoses; (ii) diagnosis concepts described as having a specific primary site but not its parts. We have proposed a method to address these issues and to automatically build a consistent model based on the NCIt and the intrinsic semantics available within ICD-O-3 and ICD-10.
Malignant, primary site – Cecum Reflexive part
Malignant, primary site – Colon Reflexive part
Malignant, primary site – Colorectal Region Reflexive part
Intraepithelial carcinoma, NOS - Breast Reflexive part
Epithelioma, NOS - Breast Reflexive part
Because Intraepithelial carcinoma, NOS has an in situ behavior and Epithelioma, NOS has a malignant, invasive behavior, it is not consistent to be an instance of these two diagnoses. This issue emphasizes erroneous mappings that may exist between ICD-O-3 and the NCIt due to ambiguous labels. A simple solution to this problem would be to add a concept representing the “Carcinoma” category of which both Carcinoma a Carcinoma In situ should be subclasses.
It is noteworthy that these patterns, which are mainly due to is_a overloading, can easily be retrieved by searching for codes which are instances of multiple diagnoses. By linking ICD-O-3 and ICD-10 terminologies to the NCIt and adding some restrictions based on their own semantics, our method may provide a useful auditing solution. Identifying those codes within the resulting model may enable discovery within the NCIt of: (i) structural inconsistencies (e.g., Malignant cecum neoplasm related to Colon), (ii) missing concepts (e.g., Carcinoma invasive that can be related to the ICD-O-3 concept Carcinoma, NOS) and (iii) missing relationships between concepts (e.g., Cecum which should be defined as a part of Colorectal Region).
In order to build the organ reflexive part lattice, parts of anatomical concepts were identified using transitive part-whole properties available within the NCIt (namely anatomic_structure_is_physical_part_of and anatomic_structure_has_location). This results in including cell parts as (indirect) subclasses of topographies (e.g. Birbeck Granule part_of Langerhans Cell part_of Epidermis part_of Skin). This would suggest that we allow a neoplasm to have Birkbeck Granule as primary site. Since the range of the disease_has_primary_anatomic_site property includes cells parts, such an assertion is allowed in the NCIt. Thereby, the built hierarchy is in accordance with the NICt representation of primary sites. As discussed in , the transitivity of the part_of property remains controversial. For instance, in , Rescher stated that “A part (i.e., a biological sub-unit) of a cell is not said to be a part of the organ of which that cell is a part”, which is in contradiction with that stated within the NCIt. However, diagnoses retained in the “derivative” model where those subsuming at least an NCIt concept so that diagnosis definitions remain realistic (because NCIt describes only existing, even if sometimes rare, tumors). Nevertheless, further work should be done in order to address this issue. In this context, patterns proposed by Schulz and Hahn in  are to be investigated.
The NCIt is known to contain some inconsistencies. Thus, the OWL-DL version of the NCIt should be used cautiously. However, this resource is helpful in order to build a formal model for integrating heterogeneous cancer disease classifications. Indeed, the NCIt remains a rich knowledge resource and, as shown in this work, it is possible to extract parts of this resource and reorganize them so as to correct some of these inconsistencies (such as is_a overloading). Even if classes introduced in the “derivative” model are consistent (they have been classified based on their formal definition), instantiating them with terminology codes can lead to misclassifications. Our approach for classes instantiation is based on the classification of NCIt concepts within the “derivative” model, which in turn depends on the formal definition of NCIt concepts. We have proposed an approach, which identified possible misclassifications thanks to multiple classes instantiation by a single code. While this approach enables to find inconsistencies, there is a need for methods capable of selecting the class that should be instantiated ultimately in this situation.
Using the NCIm CUI to map the NCIt to ICD-O-3 and ICD-10 can be useful but is not enough because mappings are missing and some are inconsistent. We are currently working on a method based on the NCIm to identify additional mappings.
SNOMED-CT®; exposes comparable structural characteristics with diagnosis, anatomic and even morphological concepts as well as relationships between them. Future work will explore SNOMED-CT as a resource for integration purpose. As SNOMED-CT is known to have the same inconsistencies as NCIt, we will study the feasibility of using both SNOMED-CT and the NCIt to build a consistent model addressing semantic and structural heterogeneities between disease classifications in oncology.
Topographies representation needs to be refined in order to avoid inconsistencies and define consistent levels of granularity for the propagation of the disease_has_primary_anatomic_site property. The Foundational Model of Anatomy (FMA) Ontology  “is a domain ontology that represents a coherent body of explicit declarative knowledge about human anatomy” . Further work will explore the ability to define these topographies based on the FMA.
The main goal of this work is to provide a consistent resource for the integration of heterogeneous disease classifications in oncology. While our approach based on the NCIt seems promising, two main limitations have been discussed (misclassification of codes within the “derivative” model and incomplete coverage due to NCIm mapping methods). The proposed pattern for the integration of disease classifications in oncology enables to extract knowledge from available resources. We will apply the same approach to other resources (e.g., SNOMED-CT and FMA) so as to enrich the “derivative” model. This future work will allow a better coverage and we will take advantage of existing links between concepts within these resources (i.e., anatomical descriptions from the FMA) and between resources (i.e., mappings available within the NCIm).
In addition, based on this “derivative” model, algorithms for disease identification can be built. This resource can manage heterogeneity by providing an integrated view of diagnoses recorded in EHRs. As a result, algorithms based on classes of the “derivative” model can use transparently available data coded with disease classifications and focus on building consistent rules for disease identification.
We have proposed a method to automatically build a model for integrating ICD-10 and ICD-O-3 based on the NCIt. The resulting “derivative” model is a consistent machine understandable resource that enables an integrated view of these heterogeneous terminologies. The NCIt structure and the available relationships can help to bridge disease classifications taking into account their structural and granular heterogeneity. However, (i) inconsistencies exist within the NCIt leading to misclassifications when instantiating the “derivative” model with terminologies, (ii) the “derivative” model only integrates a part of ICD-10 and ICD-O-3. The NCit is not sufficient for integration purpose and further work based on other termino-ontological resources is needed in order to enrich the model and avoid identified inconsistencies.
Availability of data and materials
Data sharing not applicable to this article as no datasets were generated or analysed during the current study.
VJ developed and implemented methods, conducted the analysis, interpreted results and wrote the manuscript. FM developed methods, interpreted results and contributed to the manuscript. BB contributed to methods, interpreted results and contributed to the manuscript. FT contributed to methods, interpreted results and contributed to the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Murdoch TB, Detsky AS. The inevitable application of big data to health care; 309(13):1351–52. doi:10.1001/jama.2013.393.
- Prokosch HU, Ganslandt T. Perspectives for medical informatics. reusing the electronic medical record for clinical research. 48(1):38–44.
- Safran C, Bloomrosen M, Hammond WE, Labkoff S, Markel-Fox S, Tang PC, Detmer DE, Panel E. Toward a national framework for the secondary use of health data: an american medical informatics association white paper; 14(1):1–9. doi:10.1197/jamia.M2273.
- Botsis T, Hartvigsen G, Chen F, Weng C. Secondary use of EHR: Data quality issues and informatics opportunities. 2010:1–5. Accessed 2013-01-10.
- Barton C, Kallem C, Van Dyke P, Mon D, Richesson R. Demonstrating “collect once, use many” - assimilating public health secondary data use requirements into an existing domain analysis model. 2011:98–107. Accessed 2013-01-10.
- El Fadly A, Rance B, Lucas N, Mead C, Chatellier G, Lastic PY, Jaulent MC, Daniel C. Integrating clinical research with the healthcare enterprise: from the RE-USE project to the EHR4cr platform. 44 Suppl 1:94–102. doi:10.1016/j.jbi.2011.07.007.
- Black RJ, Simonato L, Storm HH, Démaret E. International Agency for Research on Cancer: Automated Data Collection in Cancer Registration. IARC.
- Jouhet V, Defossez G, Ingrand P. Automated selection of relevant information for notification of incident cancer cases within a multisource cancer registry. 52(4). doi:10.3414/ME12-01-0101.
- Tognazzo S, Emanuela B, Rita FA, Stefano G, Daniele M, Fiorella SC, Paola Z. Probabilistic classifiers and automated cancer registration: An exploratory application. 42(1):1–10. doi:10.1016/j.jbi.2008.06.002. Accessed 2013-01-29
- Tognazzo S, Andolfo A, Bovo E, Fiore AR, Greco A, Guzzinati S, Monetti D, Stocco CF, Zambon P. Quality control of automatically defined cancer cases by the automated registration system of the venetian tumour registry quality control of cancer cases automatically registered. 15(6):657–664. doi:10.1093/eurpub/cki035. Accessed 2013-01-29
- Contiero P, Tittarelli A, Maghini A, Fabiano S, Frassoldi E, Costa E, Gada D, Codazzi T, Crosignani P, Tessandori R, Tagliabue G. Comparison with manual registration reveals satisfactory completeness and efficiency of a computerized cancer registration system. 41(1):24–32. doi:10.1016/j.jbi.2007.03.003. Accessed 2013-02-04
- Tagliabue G, Maghini A, Fabiano S, Tittarelli A, Frassoldi E, Costa E, Nobile S, Codazzi T, Crosignani P, Tessandori R, Contiero P. Consistency and accuracy of diagnostic cancer codes generated by automated registration: comparison with manual registration. 4:10. doi:10.1186/1478-7954-4-10. Accessed 2013-01-29
- Organization WH. International statistical classification of diseases and related health problems vol. 1. World Health Organization.
- Fritz AG. World Health Organization: International Classification of Diseases for Oncology: ICD-O.
- Hartel F, Warzel DB, Covitz P. OWL/RDF/LSID utilization in NCI cancer research infrastructure. In: W3C Workshop on Semantic Web for Life Sciences.
- Ceusters W, Smith B, Goldberg L. A terminological and ontological analysis of the NCI thesaurus. 44(4):498. Accessed 2014-03-20.
- Kumar A, Smith B. Oncology ontology in the NCI thesaurus. In: Artificial Intelligence in Medicine. Springer. p. 213–220. http://link.springer.com/chapter/10.1007/11527770_30. Accessed 2015-07-01.
- Schulz S, Schober D, Tudose I, Stenzhorn H. The pitfalls of thesaurus ontologization - the case of the NCI thesaurus. 2010:727–731. Accessed 2015-07-01.
- Scheuermann RH, Ceusters W, Smith B. Toward an Ontological Treatment of Disease and Diagnosis. Summit Transl Bioinforma. 2009; 2009:116–20. Accessed 2016-09-07TZ.Google Scholar
- NCI Thesaurus (NCIt) - EVS - National Cancer Institute - Confluence Wiki. https://wiki.nci.nih.gov/display/EVS/NCI+Thesaurus+(NCIt). Accessed 2015-03-03.
- NCI Metathesaurus. http://ncimeta.nci.nih.gov/ncimbrowser/. Accessed 2015-03-11.
- Bodenreider O. The unified medical language system (UMLS): integrating biomedical terminology. 32:267–70. doi:10.1093/nar/gkh061.
- Tao C, Pathak J, Solbrig HR, Wei WQ, Chute CG. Terminology representation guidelines for biomedical ontologies in the semantic web notations. 46(1):128–138. doi:10.1016/j.jbi.2012.09.003. Accessed 2016-04-05
- Using OWL and SKOS. https://www.w3.org/2006/07/SWD/SKOS/skos-and-owl/master.html. Accessed 2016-04-05.
- Falco R, Gangemi A, Peroni S, Shotton D, Vitali F. Modelling OWL ontologies with graffoo In: Presutti V, Blomqvist E, Troncy R, Sack H, Papadakis I, Tordai A, editors. The Semantic Web: ESWC 2014 Satellite Events. Springer. p. 320–325. http://link.springer.com/10.1007/978-3-319-11955-7_42. Accessed 2016-04-06.
- Simple Part-whole Relations in OWL Ontologies. https://www.w3.org/2001/sw/BestPractices/OEP/SimplePartWhole/. Accessed 2016-04-05.
- Schulz S, Romacker M, Hahn U. Part-whole reasoning in medical ontologies revisited–introducing SEP triplets into classification-based description logics. Proceedings of the AMIA Symposium. 1998:830–834. Accessed 2016-08-04.
- Schulz S, Hahn U. Part-whole representation and reasoning in formal biomedical ontologies. Artif Intell Med. 2005; 34(3):179–200. doi:10.1016/j.artmed.2004.11.005. Accessed 2016-08-02View ArticleGoogle Scholar
- ICD Conversion Programs - SEER. http://seer.cancer.gov/tools/conversion/. Accessed 2016-02-16.
- Héja G, Surján G, Varga P. Ontological analysis of SNOMED CT; 8:8. doi:10.1186/1472-6947-8-S1-S8. Accessed 2015-07-01
- Bodenreider O, Smith B, Kumar A, Burgun A. Investigating subsumption in SNOMED CT: An exploration into large description logic-based biomedical terminologies. 39(3):183. doi:10.1016/j.artmed.2006.12.003. Accessed 2015-07-02
- Rescher N. Axioms for the part relation. Philos Stud. 1955; 6(1):8–11. doi:10.1007/BF02341057. Accessed 2016-08-02View ArticleGoogle Scholar
- Rosse C, Mejino Jr JLV. Anatomy Ontologies for Bioinformatics. Computational Biology In: Burger A, Davidson D, Baldock R, editors. London: Springer: 2008. p. 59–117. doi:10.1007/978-1-84628-885-2_4. http://link.springer.com/chapter/10.1007/978-1-84628-885-2_4/. Accessed 2016-08-05.
- Foundational Model of Anatomy | Structural Informatics Group. http://si.washington.edu/projects/fma. Accessed 2016-08-05.