In this section, we propose an automated approach for extracting task-specific schema from RDF data sources in order to enable the efficient formulation of SPARQL data selection and integration queries without direct access to the data. First, we describe basic requirements for the extracted schema, as well as the fundamental idea of the schema extraction technique before subsequently introducing a number of extensions, in order to support for more generally applicable schema extraction methodology. We discuss the trade-offs to be made between different versions of the schema extraction approach and finally show how the extracted schema can be used further for the data selection and integration.
In the context of RDF data, the fundamental knowledge required for the creation of SPARQL queries for data selection and integration consists of the various rdf:type objects, the rdf:Property predicates and the structural relations between them. This information can itself be represented using Semantic Web Standards, such as RDFS, OWL, ShEx or SHACL.
While shape languages such as ShEx and SHACL are natural candidates for representing prescriptive data schema, they are designed specifically for the validation of clearly structured individual data shapes and to communicate explicit graph patterns. As such they are however not equally well suited for the formalization of the flexible schema of entire semi-structured datasets.
RDFS on the other hand provides a simple and descriptive structural annotation of the relationships between properties and classes and as such serves as a promising candidate for the task at hand.
While OWL further extends RDFS with a powerful set of description logic-based modeling primitives, the corresponding semantic complexity adds significant overhead to the schema extraction process. Especially since the extracted schema is only meant to be used for query authoring and explicitly not for reasoning, in the context of this work we generally restrict our effort to extracting schema using RDFS and the OWL owl:equivalentClass, owl:equivalentProperty and owl:sameAs predicates, which we deem most relevant in order to enable interoperability and the effective formulation of selection and integration queries.
Especially in order to ensure interoperability with existing Semantic Web technologies and compatibility with standard Semantic Web tools, such as schema-introspection-assisted SPARQL query builders, the extracted schema should thus be available as a simple RDFS and OWL vocabulary via a SPARQL endpoint.
Schema-introspection refers to the process of examining the schema definition to determine which types of entities exist, which properties are defined upon them and subsequently, what can be queried for. Since the schema needed to create data queries (e.g. using SPARQL) only contains basic structural information about the original data, it also conveys far less privacy critical information than exposing the actual data. As such it can be published publicly without privacy concerns in many scenarios.
In the following, we describe an automated approach for schema extraction from RDF data which allows for the formulation of data selection and integration queries without direct access to the data and the subsequent evaluation of that query in a secure enclave.
Schema extraction
We propose an approach for schema extraction based on exploiting key characteristics of RDF, RDFS, and OWL. RDF data encoded in compliance with corresponding vocabularies inherently include metadata about their semantics and structural relationships.
For the schema extraction, the rdf:type relation plays the key role, as it declares data points to be instances of specific data types or, according to RDFS terminology and semantics [53], classes. Anything that is a type in the sense of occurring as the target of this relation should thus automatically becomes part of the schema as an entity of type rdfs:Class. Additionally, any property relation (that is any identifier occurring in the predicate position of a subject-predicate-object triple) which occurs in the data should be included as an entity of type rdf:Property. Finally all directly describing properties of these classes and properties should be included as well. For the scope of this work, we assume that all data in the private data repository is sensitive and should remain private.
Entailment supported schema extraction Assuming perfect conditions, namely proper inclusion of all used vocabularies into the triple store, correct usage of those vocabularies, as well as OWL entailment [26] support of the SPARQL endpoint providing access to the data, the entire schema of a given RDF data set can be extracted using a single simple SPARQL CONSTRUCT query as depicted in Listing 1.
Note that we explicitly define the relevant subset of all available schema information to be that which is actually used in the data, i.e. the instantiated schema, and thus only extract that.
The preceding query constructs an RDF graph (line 1) containing all the directly describing triples ?s ?p ?o that occur in the tripe store but having only the following subjects:
-
1
Instantiated RDF properties ?s (line 3) which according to RDF 1.1 Semantics [53] are any IRI used in predicate position (c.f. rdfD2).
-
2
Instantiated RDFS classes ?s (line 4) via their occurrence as the object of a triple with rdf:type as the predicate. The fact that these are RDFS classes follows directly from the RDFS axiomatic triple rdf:type rdfs:range rdfs:Class. in conjunction with RDFS entailment pattern rdfs3 [53].
According to the SPARQL entailment regime, all the subclass relationships, transitive properties, equivalences etc. used in the data are automatically materialized (i.e. included in the dataset as inferred knowledge as illustrated in Fig. 1) and thus resolved and included too (c.f. [53, 54]).
It should be noted that the query only extracts direct properties (i.e. triples ?s ?p ?o directly related to the subject ?s) and as such, some complex constraints such as OWL disjointness axioms are not included in the extracted schema. However, as stated before, for the task of query formulation we consider this to be sufficient.
Directly instantiated schema Since in practice few SPARQL endpoints actually support any kind of entailment and usually do not materialize implicit triples, the applicability of this basic approach is limited. While the original query can theoretically also be executed without entailment support, it does not guarantee that all used properties and classes are annotated accordingly as rdf:Property and rdf:Class and completely ignores any resource ?s that lacks further describing triples ?s ?p ?o.
Thus, in the following we introduce several revisions of the initial extraction query 1 that allow us to reintroduce the missing triples without relying upon entailment support. Additionally, many datasets de facto employ terms from a number of different vocabularies and ontologies and deviate from the originally intended information model. Since the availability of information about domain and range of the different properties employed in the dataset is especially relevant in order to assist the query creation process, we further explicitly construct rdfs:domain and rdfs:range statements according to the property’s respective usage in the dataset.
In scenarios where it is sufficient to consider only those types and properties that are directly used in the dataset or where no information whatsoever about the employed vocabularies is available, it can be reasonable to disregard the inference generalizations and equivalences entirely. Listing 2 proposes a SPARQL query for the extraction of a corresponding schema, which closely reflects the structure of the underlying data and works even if the definitions of the employed ontologies are unavailable.
For this and all further queries, we assume standard SPARQL namespace and prefix definitions as specified by the World Wide Web Consortium’s OWL and SPARQL specifications [53, 55].
Analogously to query 1, we detect predicates as any Internationalized Resource Identifier (IRI) used in predicate position (line 5) and classes as IRIs used as objects of RDF type triples (line 9). We also include any additional information directly relating to those subjects that might be available in the dataset (lines 8 and 11). To explicitly construct rdfs:domain and rdfs:range information of the predicates, we further determine the rdf:type of each subject (line 6) and object (line 7), if available. Additionally we filter out any class declarations without an own identifier (line 10) to avoid potential referencing issues with the extracted schema. Lastly we construct the schema graph as all discovered predicates (explicitly typed as rdf:Property) and their related information (line 2) and all discovered classes (explicitly typed as rdfs:Class) and their related information (line 3).
When applying this extraction approach to the dataset depicted in Fig. 1, we end up with the schema depicted in Fig. 2 where classes are highlighted in blue and properties in green (i.e. with implicit rdf:type triples).
Subsequently, in this exemplary use case, following the extracted schema closely one could query for instances of the ex:Patient class and their corresponding property ex:treatedAt, which however perfectly reflects the available dataset without inferred knowledge.
It should be noted, that this extracted schema is explicitly not suited for triple entailment according to RDFS semantics, due to the conjunctive nature of multiple rdfs:domain and rdfs:range definitions on properties (c.f. RDFS entailment patterns rdfs2 and rdfs3 [53]). A semantically correct alternative would be the usage of Schema.org’s schema:domainIncludes and schema:rangeIncludes properties in line 2, instead of their RDFS equivalents. However, since RDFS domain and range semantics are implemented in a variety of tools for schema exploration, visualization and assisted query authoring [56–58], while schema.org semantics are not equally well supported, we deliberately defer semantic correctness to a closer representation of the underlying data’s structure.
Locally inferred schema In order to re-include previously inferred information such as additional types and classes due to sub-property, subclass, domain, range or equivalence relationships, we can extract the relevant schema directly from the data and the full definitions of the employed ontologies using the SPARQL 1.1 Property Paths [59] feature, independent of entailment support or statement materialization on the endpoint.
A corresponding SPARQL query is depicted in Listing 3.
The query constructs a graph, which in addition to all instantiated RDFS classes and RDF properties (and their direct properties) includes generalizations and equivalent resources of those via RDFS and OWL semantics.
For both properties and classes, we resolve corresponding generalizations directly using the relevant RDFS entailment patterns (rdfs5, rdfs7, rdfs9, rdfs11) [53] and concept equivalences using OWL’s owl:equivalentClass, owl:equivalentProperty and owl:sameAs predicates [54] in lines 5 and 9. While owl:sameAs is only supposed to be used for the declaration of equivalence between individuals, it is commonly misused in practice and as such deliberately included in this query.
rdfs:Class annotations are further inferred following RDFS entailment rules rdfs2 and rdfs3 [53] from rdfs:domain and rdfs:range properties declared on instantiated rdf:Property resources (line 7).
When applying this extraction approach to the dataset depicted in Fig. 1, we end up with the relevant schema depicted in Fig. 3. As before, classes are highlighted in blue and properties in green.
Following the extracted schema, it is now also possible to query for instances of the hospital and person classes, as well as a number of equivalent SNOMEDCT vocabulary terms.
Employing terminology services
In practice, individual SPARQL endpoints providing access to individual datasets cannot be (and are not) burdened with serving all vocabularies and terminologies used in the dataset and related to those. That is the purpose of specialized terminology services and vocabulary catalogs, such as the aforementioned LOV and BioPortal projects.
In order to resolve equivalences and generalizations across vocabularies, it is thus possible to make use of the SPARQL 1.1 Federated Query protocol [60, 61] in order to entail additional schema triples using external terminology services. The query depicted in Listing 4 employs federated queries to the SPARQL endpoint http://example.org/terminology in order to accomplish this. The query further explicitly filters out all subject that are blank nodes in order to avoid renaming and resolution issues between blank nodes from different sources (c.f. [60]).
While the approach follows the same principles as the previously introduced local inference (c.f. Listing 3), here each inference step also includes results from the external terminology service. As such, following the example from before, the extracted schema would now also include all inferred knowledge from the SNOMEDCT vocabulary as well as any vocabulary known to the terminology service that declares equivalences with SNOMEDCT.
In some cases, such as with rare diseases, even the limited communication with remote terminology services might affect data privacy, since the instantiation of certain very rare classes or predicates might in itself reveal private data. In such cases a local terminology service can be employed, i.e. by creating a local deployment of the LOV service or by providing local copies of the relevant full vocabularies. Nevertheless, sharing of the extracted schema in such cases may still require additional considerations.
Unfortunately, current implementations of federated SPARQL queries still typically incur large performance penalties by using suboptimal resolution strategies. As such, in practice, it is often helpful to manually decompose the single query into multiple query steps. An exemplary four-step approach using the SPARQL 1.1 UPDATE construct [62, 63] can be found in the supplementary materialsFootnote 2, which also includes performance optimized reformulations of the other queries.
Schema-aided data selection and integration
Once the schema is extracted, the resulting schema can be publicly or semi-publicly (e.g. with prior authentication) exposed using a dedicated SPARQL endpoint. It is then possible to use existing SPARQL query writing assistance tools (i.e. query builders) such as OWLPath [64], QueryVOWL [58] or VSB [57] together with the extracted schema for schema introspection aided design of data selection and integration queries without direct access to the private dataset. An overview of available tools can be found in [65].
Figure 4 depicts a screenshot of the visual query builder VSB [57], configured to employ introspection of a schema extracted using the “locally inferred” approach, as conducted in the following evaluation. Corresponding instructions for schema extraction and deployment can be found in the supplementary materials. This example illustrates how introspection of the public schema allows for the automated suggestion and autocompletion-assisted search for available properties and classes, as well as the relations between them, enabling easy query writing through interactive schema exploration. In the depicted case, the user is interested in instances of the schema:Person class and provided with a list of property suggestions for the search string “fa”, as available in the original private data.
Such tools may optionally also employ the provided schema in order to construct SPARQL 1.1 queries that can resolve term generalizations and equivalences following the semantics of the extracted schema. As such, the user does not have to rely upon proper entailment support of the dataset SPARQL endpoint but can construct explicit queries that specify the relevant equivalences, further enabling ad-hoc data integration queries through the provided resource equivalences.
As such, e.g. in the example depicted in Fig. 3, it is likely that the private data endpoint does not support entailment. Thus, the query must be constructed in a way to account for the semantic implications of the schema. For example, in order to find all persons, one would have to query not only for all instances of the person class, but also for all instances of its equivalent classes, subclasses, their equivalent classes, as well as those that occur as subject or object of a property with corresponding domain or range, in this case subject of a triple with ex:treatedAt predicate. Query builders and query writing assistance tools can however automatically construct queries accounting for this without burdening the user. Such queries thus allow for the ad-hoc integration of data encoded with different ontologies and standards, based only on the previously extracted schema.
System architecture The workflow of the proposed architecture is illustrated in Fig. 5, which depicts the communication between client and data provider over a public network. In this scenario, the data provider’s internal communication within its private network is highlighted by the bounding box.
In preparation for client usage, the schema of the sensitive data stored in the private triple store is extracted in step 0 using the approach presented above and deployed to a publicly accessible schema endpoint.
Since the private data store remains inaccessible from outside its private network at all times, the schema extraction has to be conducted by the data provider herself. This could either be done by manually extracting the schema on-demand, e.g. using the four-step “LOV inferred” schema extraction approach employing the SPARQL Update construct, by automatically running a corresponding extraction script in regular time intervals or by creating a “schema view” for the data store, which can then directly be queried by data consumers.
Once the schema endpoint is available, the client can start to create a SPARQL query in step 1, using a query builder of their choice in conjunction with the schema endpoint for introspection. The query is then sent to a submission endpoint acting as the gateway between the data provider and the client in step 2. For the scope of this work, we assume that this requests includes algorithmic means of data anonymization, ensuring its results are no longer privacy sensitive and that validation is done manually.
Once validated, the request is scheduled in step 4 for processing within a secure enclave (processing), where the query and algorithm are evaluated (step 5). This is analogous to the approach proposed by Jochems et al. [47] and Deist et al. [48] as detailed in the related work section. Finally, only the processing result is returned to the client in step 6 without ever directly granting access to the data.
Evaluation
In order to evaluate the proposed approach, we extract schema information from a synthetic dataset of patient records (PRs), specifically generated in order to illustrate the intended use case, as well as the three corpora GenDR, Orphanet and NCBI Homologene, as distributed through the third release of the interlinked life science data repository Bio2RDF [66].
The PRs dataset contains personal information of 10,000 individuals such as name, birthday and phone number and is published in conjunction with this paper. The dataset was generated using the open source generatedata toolFootnote 3 and converted to a corresponding dataset of 15,0000 RDF triples using the SPARQL Generate extension [67, 68]. Half of the records are encoded using the FoaF vocabulary [69] and half using the Schema.org vocabulary [70].
GenDR [71] is a database of genes associated with dietary restriction (DR), intended to facilitate research on the genetic and molecular mechanisms of DR-induced life-extension.
Orphanet[72] is a database of information on rare diseases and orphan drugs for all publics, intended for the improvement of the diagnosis, care and treatment of patients with rare diseases.
HomoloGene [73] is a database of homolog sequence relationships between 20 completely sequenced and annotated eukaryotic genomes.
GenDR, Orphanet and Homologene respectively provide custom vocabulary definitions describing their data encoding and semantics.
All four datasets were separately deployed to a private triple store and relevant schema extracted using the three presented direct extraction methods, i.e. extracting only directly instantiated properties and classes using query 2, using local inference together with the respectively employed vocabularies (such as FoaF and Schema.org definitions for the PRs dataset) via query 3 and finally using the LOV terminology server via query 4. The employed data and scripts may be found in the supplemental materials.