In this section we will review the main foundations of the work that we present in the paper, namely HL7 and the HL7 RIM, the R2RML language, and morph-RDB.
HL7 RIM
Recent years have witnessed a huge increase of biomedical databases [19]. This increased availability opens up new opportunities, while setting some new important challenges, especially with respects to their integration, which is crucial to obtain a proportional increment of knowledge in the biomedical area. In this context, it is common to establish a CDM for the representation of biomedical data which allow exploiting multiple established terminologies to build a core concept dataset as the common medical vocabulary of the platform
Among the many Detailed Clinical Models that have been reviewed for the integration of biomedical datasets [20], the HL7 v3 is one of the most relevant, since main requirement for the CDM is that any data coming from clinical institutions can be represented without loss of information. The HL7 RIM offers a wide coverage for representing clinical data and has proven useful for clinical information exchange. The HL7 v3 standard defines the RIM at its core. This definition consists of a UML class diagram (it does not define a data structure or a database model). Besides, issues such as the management of data types are not trivially translatable into a database model. As a consequence, we previously defined a relational model for it, which can be seen in Fig. 1 and described in [13].
The HL7 RIM backbone contains three main classes: Act, Role and Entity, which are linked together by three association classes (Act-Relationship, Participation and RoleLink). The core of the HL7 RIM is the Act class. An Act is defined as “a record of an event that has happened or may happen”. Any healthcare situation and all information concerning it should be describable using the RIM by including the type of act (what happens), the actor who performs the deed and the objects or subjects Entity that the act affects to Role. Some additional information may be provided to indicate location (where), time (when), manner (how), together with reasons (why) or motives (what for). Act and Entity classes have some specializations that add some attributes, such as Observation (a subclass of Act), or Person (a subclass of Entity).
This standard is able to represent almost any healthcare situations and a wide variety of information associated with it [21]. Based on this idea, we have defined a subset of the HL7 RIM schema where we implement the classes and attributes that are necessary to represent the scenario for sharing clinical breast cancer clinical trials data:
-
Act, with the subclasses Observation, Procedure, SubstanceAdministration, and Exposure.
-
Role.
-
Entity, with the sub-classes LivingSubject, Person, and Device.
-
The classes; i) ActProcedureApproachSiteCode, ii) ActMethodCode,
iii) ActTargetSiteCode, iv) ActObservationInterpretationCode, and v) ActObservationValues related to Act.
Attribute data types are rather complex on the RIM, so they are changed according to the mentioned scenario, following HL7 datatype specifications [22]. Therefore some attributes were simplified in the relational model compared to those defined by HL7 v3 standard. To improve performance and understanding of the HL7 RIM schema, it is defined a set of views. These views cover the access retrieval requirements for the clinical scenario. We defined a view for each clinical contexts (Observation, Procedure, SubstanceAdministration, and Exposure).
Therefore, the defined HL7 RIM-based CDM above fulfills the requirements needed for breast cancer clinical trials scenario. Furthermore, we have created an ontology that reflects the HL7RIM model [23], which is available for others to reuse.
Figure 2 depicts a simplified schema of the implemented database following the HL7 v3 RIM definition. However, typically relationships among Entity and Role instances are one-to-one. Moreover, the Act table is the backbone but data is classified as one of its descendants (Observation, Procedure, Substance Administration, Exposure, etc.). Thus the logical schema for querying an Act descendant (i.e. Observation) from our database looks like the schema represented in Fig. 3.
Therefore, every Act subclass in the HL7 v3 RIM data schema can be represented as a star diagram — typically used in data warehouse definition. Our database can be visualized as a snowflake diagram similar to the i2b2 star model [6]. Each event record will be a subclass of Act (similarly to the i2b2 fact table). Entities and Roles (patient, location, care provider, etc.) are lookup tables called Dimensions.
Conversely to other works in literature that use query translation [8], since Act tables contain the biggest amount of data in the model, we have adopted the approach of dividing complex queries into atomic queries. Consequently, in order to efficiently execute queries involving several instances of acts and relationships (e.g. temporal dependencies), these queries are divided and results are later combined using set operators [13].
R2RML
R2RML [18] is a W3C recommendation for the definition of a mapping language from relational databases to RDF. An R2RML mapping document consists of a set of Triples Maps rr:TriplesMap, used to specify the rules to generate RDF triples from database rows/values. A TriplesMap consists of:
-
A logical table rr:LogicalTable that is either a base table or SQL view, used to provide the rows to be mapped as RDF triples.
-
A subject map rr:SubjectMap that is used to specify the rules to generate the subject component of RDF triples.
-
A set of predicate object maps rr:PredicateObjectMap that is composed by a set of predicate maps rr:PredicateMap and object maps rr:ObjectMap (to generate the predicate and object components of RDF triples, respectively). If a join with another triples map is needed, a reference object map rr:RefObjectMap can be used. The other triples map to be joined is specified in rr:parentTriplesMap and the join condition is specified via rr:Join
Figure 4 illustrates an overview of an R2RML TriplesMap class.
Subject maps, predicate maps, and object maps are term maps, which are used to specify rules to generate the corresponding RDF triples element, and those rules can be specified as a constant rr:constant, a database column rr:column, or a template rr:template. Figure 5 illustrates an overview of an R2RML TermMap class.
morph-RDB
morph-RDB is part of the morph suite [24]. It receives as an input the connection details to a relational database, an R2RML mapping document and a SPARQL query. It translates the SPARQL query into the underlying relational database and translates the results back into a format appropriate for the SPARQL query. The query translator component in morph-RDB implements the algorithm described in [17], which extends previous work in [25] that defined a set of mappings and functions in order to translate SPARQL queries posed against RDB-backed triples stores into SQL queries, prove the correctness of the query translation using the notion semantic-preserving. In other words, the SPARQL query realized as an SQL query returns the same answers as the same SPARQL query executed over an R2RML materialization. We extend their work by relating those mappings and functions with the R2RML mapping elements.
For an in-depth explanation of the query rewriting algorithm, we recommend the aforementioned references. As a quick summary, we use the following mappings and functions:
-
α mapping, which given a triple pattern tp and an R2RML mapping document m, returns the corresponding logical tables associated to the pattern.
-
β mapping, which given a triple pattern tp and an R2RML mapping document m, returns the corresponding columns associated to the component of the triple pattern (subject, predicate, or object).
-
name function, which generates a unique alias for the projected attributes.
-
genPRSQL function, which given a triple pattern tp, the β and name functions, and an R2RML mapping document m, generates a SQL expression that projects only the attributes returned by the beta mapping and renames them using the name function.
-
genCondSQL function, which given a triple pattern tp and an R2RML mapping document m, generates an SQL expression (returning only non-null values for that columns returned by β mapping using “IS NOT NULL” expression) that filters the logical tables returned by α to match the triple pattern tp.
-
trans function, which given a SPARQL graph pattern (triple pattern, AND, OPT, UNION, FILTER, SELECT) and an R2RML mapping document m, generates the SQL query that when evaluated, generates the result of the corresponding SPARQL pattern.
The details of the definitions and algorithms defined for the above mappings and functions are provided in [17].
Example 1
Consider the following table v_person(patientId, patientName, gender, actId) which stores the information about patients. This table is mapped to the class Patient with the attribute patientId as the identifier (together with base URI for class Patient) of the instances. Attributes patientId and patientName are mapped to ontology properties hasID and hasName, respectively. Now let’s add another table v_observation(actId, title, code) that describes observations. This table mapped to class Observation with actId as the identifier of the instances, and the attribute title mapped to property hasTitle. patientId and actId are primary keys of the tables v_person and v_observation, respectively. Furthermore, the actId of table v_person is a foreign key that refers to the column actId of table v_observation, and this relation is mapped to property hasObservation. The instances of the tables can be seen in Fig. 6.
Consider the following triple pattern tp = (?p :hasPatientName ?pName).
-
α(tp) = v_person.
-
β(tp.subject) = v_person.patientId, β(tp.predicate) = ’:hasPatientName’, β(tp.object) = v_person.patientName.
-
name(?p) = var_p, name(:hasPatientName) = iri_hasPatientName, name(pName) = var_pName.
-
genPRSQL(tp) = v_person.patientId AS var_p, ’:hasPatientName’ AS iri_hasPatientName, v_person.patientName AS var_pName
-
genCondSQL(tp) = v_person.patientId IS NOT NULL AND v_person.patientName IS NOT NULL
Finally, the results returned by the SQL queries obtained as a result of the previous step are the values stored in database servers, and not the RDF terms ones expected as a result of the evaluation of a SPARQL query. This is necessary in order for database servers to be able to exploit indexes over the database values that haven’t been transformed into other values. For example, the result for subject values may come from the primary key columns. Thus, upon receiving the database results that correspond to R2RML template mappings, morph-RDB translates the results according to those mappings.