Ontology construction
Best practices in ontology engineering recommend to reuse existing content and to create modular ontologies [46]. These recommendations are implemented reusing concepts from different ontologies so that the resulting ontology infrastructure is likely to be a networked ontology. The OBO Foundry has also developed a series of principles for ontology construction which propose principles for modularity, orthogonality and reusability [47].
The method for constructing the domain ontology used in this work consisted in identifying the main entities that should be represented, searching BioPortal for existing ontologies containing classes representing these entities, selecting the most appropriate ones (by our subjective criteria), and extending them when necessary. The final ontology has been implemented using Protégé4 in OWL-DL, which is the OWL subset based on Description Logics.
Data generation and representation
In this work, we have generated a simulated cancer registry dataset using the statistical distribution of a real registry dataset, following the method proposed in [48]. Data provided by the National Cancer Registry of Ireland5 were used to obtain a patient distribution by age. The cancer registry was accessed on 10-05-2016 and we included 533409 cases diagnosed from 1994 to 2013. The patients were generated in groups classified by gender and 5-years age ranges (0-4, 5-9, 10-14, etc.). The last group of patients contains people older than 85 years old.
For each group of patients we have calculated the probability distribution of diagnosing a concrete type of cancer, and the probability distribution of receiving a particular therapy (surgery, chemotherapy, radiotherapy, hormonal therapy,...) for a concrete diagnosis. These probabilities were used to assign weights to every type of cancer with its therapies for each group of patients. For example, for patients between 60 and 64 years old, the probabilities for different types of cancer are breast cancer (0.23), lung cancer (0.17), prostate cancer (0.17), and colorectal cancer (0.08). For patients within this age range and diagnosed with colorectal cancer the probabilities of the therapies would then be: teletherapy (0.44), chemotherapy (0.44) and surgical treatment (0.12). Figure 1 shows the stack of distributions. When the random number is between 0.57 and 0.64 we assign colorectal cancer as the patient’s diagnosis. Then, we generate a new random number to assign the first therapy and so on.
Furthermore, survival and mortality data were used for extracting the evolution of the disease. Finally, we ensured that the amount of patients with more than one cancer diagnosis meets the distribution of the real dataset.
Our simulated dataset consists in randomised cases. For each case, we establish the gender and age of the patient. Then, we apply a partially random distribution algorithm for getting the patient characteristics. This algorithm uses the weights assigned to each type of cancer, therapy or course to generate distributions similar to the original database. This algorithm is able to generate patients with one or more diagnoses with various therapies and courses following the probability distribution previously calculated.
Such data have been represented in RDF by applying SWIT, whose transformation method has three main steps: (1) definition of the mapping rules between the database schema and the ontology; (2) generation of the RDF data; and (3) importing the RDF data into the semantic data store. We use a semantic repository to store the data, which integrates two types of data sources: (1) an OWL files server with the formal representation of the domain, and (2) an RDF repository which stores the data. Virtuoso6 is used as data store [49].
Exploitation model
Our approach includes a set of methods for exploiting the information model in the semantic repository.
Ontology-driven search (ODS)
SPARQL is the language used for querying the data store. We use our ontology-guided input text subsystem [50] to make it easier for clinicians to exploit the data warehouse. The main objective is to allow users to design and execute SPARQL queries without knowing SPARQL. This tool is an editor for SPARQL queries supported by an OWL ontology. The OWL ontology provides the classes and properties that can be used for creating the SPARQL query that will be executed on the RDF repository. The construction of the queries begins with the selection of a main class of the ontology. For example, if we wish to find patients, then the ODS begins with the selection of the ontology class Patient. The user can define filters over this class by using the data properties or object properties of the ontology. The use of owl:ObjectProperty permits to include other concepts in the query. For example, if we wish to find patients whose diagnosis is lung cancer, the user can select the owl:ObjectProperty
hasDiagnosis, which is associated with the class Patient, which permits to use the owl:ObjectProperty Pathological structure of the class Diagnosis to select the class representing lung cancer. The ODS is able to generate SPARQL queries in which the subject is an ontology class, the predicate is a property and the object can be either a value or other concept. By selecting an owl:ObjectProperty, the user can add other properties of this concept to the query. This service follows the approach of template-based searches [43].
With this tool, the data store can be searched using the properties defined in the ontology. Moreover, it allows the generation of aggregated queries for the elaboration of representative charts of the data store. The generated queries can be stored for parameterisation and reuse. Aggregate functions such as count, average, min or max can be used.
The results of these queries can be linked with other resources. The filters used can also be stored for later reuse. The semantic search engine not only allows for data retrieval but also for creating new classes in the semantic model, which can be assimilated to OWL defined classes. For example, the query for patients with colon cancer could be defining the class “Patient with colon cancer”. The members of this class are obtained by executing the corresponding query.
Semantic profiles
Conceptually speaking, the semantic profile is defined as the set of relations and properties of an individual. Semantic profiles permit to identify groups of patients that share the same properties and are therefore useful for comparing and studying such groups. Ontologies are of special interest for creating profiles because they allow to select and aggregate individuals from a conceptual perspective. Our approach can also generate the semantic profile of a group of patients by applying one or more criteria.
Hence, we define a semantic profile as the subset of semantic information of an individual that is interesting for a particular analysis. The profile of the individual i is calculated as shown in Eq. 1.
$$ SP(i) = S(d) \cup S(SP(o)) $$
(1)
where S(d) represents a subset of the selected owl:datatypeProperty and S(SP(o)) represents a function that retrieves the individuals linked through owl:objectProperty axioms to i. The semantic profile is built by the application of the ODS by using the entities defined in a domain ontology. The ODS permits to select the properties of interest and to define the filtering and aggregation conditions. The user can define the SPARQL queries that will return the subset of properties and relationships that provides the best description of the individual for the specific case. This information is obtained for each individual, and the results can be viewed as a cache of the most important semantic information describing the individuals.
Semantic profiles can be seen as a purpose-specific application of the semantic search engine. Two types of semantic profiles are of special relevance in the context of this work, namely, the timeline representation of a patient and the aggregated disease timeline representation of a patient group with some common properties. Both are described in the next sections.
Disease timeline of a cancer patient
The disease timeline of a patient contains information about various health-related events (e.g. diagnosis, patient conditions, therapies and the disease courses). Retrieving these events for a patient requires data normalisation for the representation of therapies by month. Figure 2 shows that every diagnosis has an associated timeline which includes therapies and the disease course, both ordered by month. For example, we can show the timeline for a breast cancer patient that includes the applied therapies (surgical treatment, chemotherapy, etc.) for every period. Furthermore, we can show the course of the disease and its relation with changes in therapies. It also includes the date of the diagnosis and the date of the last encounter. Finally, the profile contains all the patient’s diagnoses and a list of her conditions.
Aggregated disease timeline of a group of patients
The aggregated timeline of a patient group (see Fig. 3) includes all the events of the selected patients who have the same selection criteria for a given period and for a concrete diagnosis. The groups of patients are defined using the ODS, which permits to define groups of patients with the same diagnosis, staging, grading and age range. This permits to obtain the semantic profile of each member of the group. Then, the semantic profiles of the members of the group are globally analysed, so obtaining a matrix that contains the disease courses of the included patients for every month of the disease. Using this method, the user is able to generate, for example, a group of patients with lung cancer with ages between 60 and 70 years old. In this case, our service could represent which therapies are applied in chronological order and which are the most likely courses. At the same time, these graphical representations can be used as new filters to recalculate the corresponding variables. For example, if the user selects to apply chemotherapy as first therapy, the representation changes to reflect the new scenario.
Enrichment analysis
Enrichment analysis is a type of statistical analysis that is frequently used in biomedical domains [51]. Our enrichment analysis method is based on the hypergeometric distribution method established for the GO:TermFinder to determine the significance of a Gene Ontology annotation to a list of genes [52], and the hypergeometric distribution was developed using Apache Commons Math7.
This type of analysis is useful to compare several subsets of patients with the same diagnosis. We perform a statistical analysis of the ICD-10 codes to support the users in the definition of diagnosis-based groups. We calculate the P-value for each group as shown in Eq. 2.
$$ P = 1 - \sum_{i=0}^{k-1}\frac{\binom{M}{i}\binom{N - M}{n - i}}{\binom{N}{i}} $$
(2)
where N is the total number of ICD10 codes used in the cancer registry, M is the number of diagnoses annotated with each ICD10 code, n is the number of ICD10 codes of interest for a concrete patient group and k is the number of ICD10 codes used for annotating each diagnosis.
Semantic dashboard
A semantic dashboard is a graphical representation of the results of one or more queries. Semantic dashboards are represented as 〈〈L, V〉, isDashboard, U〉 where 〈L, V〉 are the results of the SPARQL as key-value pairs 〈L, V〉, and U is who defined the dashboard. Each user can define and customise her dashboards.
The semantic dashboard is implemented using the ODS and permits to create aggregated data. The results can be represented graphically and in tabular format. Based on the persistence model of SPARQL queries, the representations can be used for accessing the data instances contained in each representation. Consequently, aggregation control boxes can be regarded as search filters of the semantic search engine.
Figure 4 shows the query generated with the ODS for searching patients over 70 years old and classified by cancer type. In the left side we show the graphical representation and in the right side the data in tabular format.
The semantic dashboards can also include multiple aggregated queries and display comparative graphics. Finally, dashboards can also be persisted, parameterised by users and reused.
Recommendation
We have developed an algorithm based on Bayesian networks to suggest the most appropriate treatment for a patient. This algorithm is based on the generation of probabilistic models using semantic nodes profiles. Bayes networks cannot have cycles [53], but our semantic dataset might contain cycles. The semantic profiles might have cycles due to, e.g., the repetitive application of a given treatment to the patient. To solve this problem a tree network is generated for each profile.
In case of being interested in knowing which treatment is likely to be the most appropriate for a patient given a number of features, the model would first retrieve all the patients with such features, and then use their semantic profile to generate the map of Bayesian networks with the possible treatments by period (month, term, etc.). Once a treatment is selected, the network is re-calculated to improve the next recommendation. Given this dynamic aspect of the network, the method requires that the user indicates which characteristics might generate a cycle in the network to prevent the algorithm from falling in an infinite loop.