The methodological guidelines
Based on our experience with applying the Linked Data principles in the healthcare and drug domain and on the existing Linked Data methodologies, we developed a set of methodological guidelines for consolidating drug data using the Linked Data approach. These guidelines improve upon the existing Linked Data methodologies and contain steps, activities and tools which are specific to the drug data domain. Their purpose is to guide data publishers through the process of generating high quality, 5-star Linked Data in order to interlink, align and consolidate drug data from different national drug registries or other sources of drug data. The alignment and relationship between the existing methodologies and our guidelines is outlined in Table 5.
Along with the guidelines, we have developed a set of tools which simplify the execution of the specific steps of the methodology. Their intention is to support the Linked Drug Data generation process for both people from the drug domain which do not have deeper knowledge of Linked Data, and Linked Data publishers which do not have deeper knowledge of the drug domain. These tools are open and publicly available on GitHub [58].
Our methodological guidelines consist of the following steps (Fig. 1):
-
I.
Domain and Data Knowledge
-
II.
Data Modeling and Alignment
-
III.
Transformation into 5-star Linked Data
-
IV.
Publishing the Linked Data dataset on the Web
-
V.
Use-cases, Applications and Services
These steps have been developed with reuse as a primary goal; therefore, their main focus is the encouragement of data publishers in the drug domain to develop, modify and use reusable components during the steps of the methodology. This makes the Linked Drug Data lifecycle modular, i.e. constructed of loosely-coupled components which can be reused in the domain. These loosely-coupled components can be used separately when necessary, but also form a seamless workflow for generating a high-quality, 5-star Linked Drug Data dataset. The reuse of such components, like in other software development cases, reduces development time and increases productivity [56, 57].
Step I: Domain and data knowledge
The first step corresponds to the first steps from the existing methodologies: it is important for the data publisher to know the domain and the data in it very well. This understanding of the data source schema and semantics is crucial for the following steps which will involve data modeling, schema alignment and data transformation.
In the drug data domain, if this is the first time the data publisher comes across Linked Data, our advice is to first get familiar with the 5-star data system from Tim Berners-Lee [35], the four principles of Linked Data [70], and the LOD Cloud [32]. After that, a study of the LODD Project [36], the Bio2RDF Project [43] and the DrugBank Linked Data dataset [44] is recommended. This will help the data publisher to get a better insight into the Linked Drug Data and Linked Healthcare Data domains, the types of data which exist in them, their schema, their similarities and differences and their existing and potential links. It will also help him/her determine the ontologies and vocabularies already used in the domain, which can be important for the next step.
In a general case, when working with any other domain, it is important for the data publisher to get familiar with the domain in question and with the meaning of the dataset selected for transformation. For this, a consult with a domain expert is usually necessary and therefore advised. Another approach is to explore the existing Linked Data datasets which are similar to or from the same domain as the one of interest. For this, the Datahub portal [71] and the LOD Cloud cache instance [72] could be used.
Step II: Data modeling and alignment
In the next step, the data publisher should focus on data modeling and alignment with other existing or future datasets. The data publisher has to choose the correct schema for the dataset, in order to annotate it correctly, i.e. use the data fields which are necessary for the final use-cases, annotate the fields unambiguously and with the correct semantics and make the correct schema choices which will allow seamless alignment with other datasets. Additionally, the data publisher has to define the URI naming scheme for the data entities, and optionally for the ontology or vocabulary classes and properties.
Data schema. In the drug data domain, studying the datasets from the LODD and Bio2RDF projects can help get an insight into the ontologies and vocabularies used in the domain. Some of the ontologies and vocabularies which a data publisher needs to have in mind are: Schema.org [59], DBpedia Ontology [73], UMBEL [74], DICOM [75], the DrugBank Ontology used for the data at [44], as well as other biomedical ontologies [76].
In order to support the data publishers from the drug domain in this step, we designed a reusable RDF schema for the data, shown in Fig. 2. The schema can be used by data publishers working with drug data from national drug registries or other sources. The schema is comprised of classes and properties from the Schema.org vocabulary [59]: the schema:Drug class, along with a large set of properties which instances of the class can have, such as name, code, activeIngredient, nonProprietaryName, availableStrength, cost, manufacturer, relatedDrug, description, url, etc. Additionally, in order to align the drug data with generic drugs from DrugBank, we use the properties brandName, genericName, atcCode and dosageForm from the DrugBank Ontology. In order to annotate the links which the drug product entities will have to generic drug entities from the LOD Cloud dataset, the rdfs:seeAlso relation is used.
In general, when working with data from another domain, the data schema is defined with the choice of vocabularies or ontologies to be used. The principles of ontology engineering and usage have been developed for this purpose exactly: to maximise the chances of reuse, and therefore allow better alignment between datasets [77]. This means the agent should always try to reuse an existing vocabulary or ontology, giving advantage to those which are most widely used. A few tools for ontology and vocabulary discovery exist, and the data publisher should use them in this stage. The two most notable are Linked Open Vocabularies (LOV) [78] and DERI Vocabularies [79], which also provide usage statistics which can be used to assess the impact of a given vocabulary or ontology in a specific domain. Our choice of the Schema.org vocabulary follows the reusability paradigm: it is the most widely and generally used vocabulary across the Web.
However, datasets tend to have specific fields, which are not covered by existing ontologies. In this cases, the existing ontology or vocabulary should be extended, or a new one should be defined. However, each time a new ontology is developed, it is important to define the mappings between the new classes and properties and the classes and properties from other ontologies, in order to enable ontology matching and RDF-based reasoning, for schema alignment. In order to avoid defining specific new properties, we reused some properties from the Schema.org vocabulary which are currently not explicitly intended for use with schema:Drug entities. An example of such properties is the schema:addressCountry property which should be used for an address, but we use it in our schema to denote the country in which the drug is registered.
Another important approach in this step is the use of upper-level ontologies and vocabularies; they can provide a schema for many different and specific domains, due to their generality. Having two or more datasets annotated with the same upper-level ontology or vocabulary allows interlinking and inference between them, i.e. it improves the alignment which is crucial for data consolidation.
Data alignment. For the data alignment task, the data publisher needs to make sure that the generated dataset will be well aligned with existing and future datasets from the same domain. In order to aid the data publishers with this, as well as help them in preparing the drug data for the transformation step, we developed a CSV template intended for the drug domain [58]. This CSV template can be used with the drug data and is comprised of the fields necessary for applying the RDF schema from Fig. 2.
The data publisher interested in publishing Linked Drug Data should use this CSV template for the data, following the specifics defined for each of the fields. These specifics assure that the data will be of high quality and completely aligned with other drug data generated using the same methodological guidelines.
URI formats. From the URI naming scheme perspective, in the domain of drug data it is important to determine the types of entities which exist in the dataset. This will help in defining the entity URIs for the Linked Data dataset. According to the Linked Data principles, each entity in the dataset - along with the classes and properties in the ontology - needs to have a unique indentifier in the form of an HTTP URI. In order to provide better performance when using the dataset in the future, our experience suggests using separate URL paths for different entity types, e.g.
http://example.com/drug/
,
http://example.com/interaction/
,
http://example.com/disease/
, etc. An additional recommendation is to use slash-based URIs, instead of hash-based ones. This may result in using an additional HTTP request by the machine accessing the URI, but it provides better performance when accessing large datasets [80].
However, to simplify this step for the drug data publishers, we advise the use of the existing webpage URLs of the drugs from the national registry websites, which are already unique. According to the Linked Data principles, the entity URI should denote a Web location where the end-users and agents can get more information about the entity, so our approach satisfies the condition.
Step III: Transformation into 5-star Linked Data
During the third step, the source dataset should be transformed into a 5-star Linked Data dataset. The process of transformation can be executed in many different ways, and with various software tools, e.g. OpenRefine [81], LODRefine [82], D2R Server [83], Virtuoso [84], Silk Framework [85], etc.
To help the data publishers from the drug domain and to automate this step, we developed a reusable OpenRefine transformation script [58]. This transformation script is specifically designed for the drug data domain, and the RDF schema and CSV template from Step II. It contains a set of actions which generate RDF from the inputed CSV file which contains drug data. In the process, it also locates associated generic drugs from the DrugBank and DBpedia datasets for each drug product in source dataset, and extends the generated RDF with links between the drugs from the dataset and the corresponding drugs from the LOD Cloud.
The transformation script can be reused with any OpenRefine instance which has the RDF extension. It can be applied on any drug data dataset formatted with the CSV template from Step II. As a result, it will generate a Linked Drug Data dataset annotated with the RDF schema from Step II (Fig. 2).
The RDF schema from Step II defines relations between the drug products from the dataset as well. These relations are denoted with the schema:relatedDrug relation (Fig. 2). In order to provide means for generating RDF triples which interconnect the drugs from the dataset, we developed a SPARQL query [58] which can be executed over the dataset generated with the OpenRefine transformation script. The SPARQL query detects all pairs of drugs from the dataset which have the same ATC code - and therefore have the same therapeutic, pharmacological and chemical properties - and creates two triples connecting the first drug to the second one with the schema:relatedDrug property, and vice-versa.
In a general case, in order to make the correct choices about the tools to be used for the transformation process, it is important to distinguish the characteristics of the dataset first. The nature of the dataset will determine if (a) the transformation is a one-time task, a task which will have to be executed on a given time interval (e.g. once a month), or a continually running task; (b) old versions of the transformed dataset are necessary for versioning and as backup, if during future transformations only the changes in the data are needed for transformation, i.e. ‘delta’ updates are performed, or if older data are no longer necessary for the particular use-case; (c) manual or automated data cleansing is needed before the first transformation and/or subsequent transformations; (d) the source dataset is always available at the same location and is accessible via the same interfaces. These specifics of the dataset in question can then help the data publisher determine if the transformation task can be fully or partially automated, and identify the parts of the transformation workflow which require human attention and input.
Adding metadata about the newly created Linked Data dataset is significant from the data reuse perspective - using vocabularies such as VoID [86] help ubiquitously determine the characteristics of the dataset and the links the dataset has to other Linked Data datasets, through software agents. VoID metadata contains information about the name, description and category of the dataset, versioning information and update frequency, contact information, the license under which the dataset is made available, the links to the SPARQL endpoints and URI lookup endpoints, used vocabularies and their properties and classes. It also explicitly defines the links between the dataset and other Linked Data sets, defined in the dataset itself. The use of the VoID vocabulary is explicitly stated in the corresponding steps in the methodologies of Hyland et al., Hausenblas et al. and Villazón-Terrazas et al.
In the domain of drug data providing new and updated data is very important - old data has no importance to the end-user, except for analytics. This means that the data publisher should anticipate the change rate of the source dataset and correctly design the workflow of refreshing data from the source dataset to the Linked Data dataset. This would translate to creating a sustainability plan which will transform new data and add it to the Linked Data dataset, remove old data and provide versioning. Depending on the size of the source dataset, the data publisher can choose to re-transform the source dataset on each update, or to provide means for performing ‘delta’ updates. Providing versioning is also important, as new transformations can sometimes result in errors, rendering the dataset unusable.
Step IV: Publishing the Linked Data dataset on the Web
In the forth step, the generated 5-star Linked Data dataset, along with its VoID metadata, should be published on the Web. This should be done following the W3C recommendations for publishing Linked Data on the Web [50], which suggest enabling direct URI resolution, providing a RESTful API, providing a SPARQL endpoint, and/or providing the dataset as a file for download.
There is a large palette of tools and software platforms which allow seamless Linked Data publishing. Among them are D2R Server [83] and Virtuoso [84], which allow Linked Data publishing of datasets which are originally in an RDF file (Turtle, N3, RDF/XML, JSON-LD, etc.), a CSV file, or in a relational database. These platforms then allow access to the Linked Data dataset via HTML pages, via RDF file downloads and via a SPARQL endpoint which can be used as a RESTful API as well.
When creating a Linked Drug Data dataset, we recommend adding and interlinking it with the global LinkedDrugs dataset which will consist of all such datasets generated by different parties, using these guidelines. To enable this, we have developed a web-based tool for uploading the generated datasets [60], which after a human-based quality assessment triggers an automated process for interlinking the new dataset with the existing LinkedDrugs datasets and publishing it according to the Linked Data principles and best practices.
We also recommend publishing the dataset at Datahub.io [71] under the healthcare and drugs categories, as well as adding the #linkeddrugs tag. Additionally, we advise joining the LOD Cloud [87]. Both these actions will enable higher visibility of the dataset.
Another important part of this step is the announcement of the newly created Linked Data dataset to the public. For this, information about the dataset along with its VoID metadata should be published on popular data portals, such as Datahub.io [71]. This announcement should also be done via existing communication channels of the data publisher and his/her organization. In order to facilitate further use and reuse of the dataset, it is important to provide a form or a contact email address for interested parties to be able to report data or access issues, and provide feedback. On the organization side, it is important that these reports and requests are attended to in a timely fashion; otherwise the usability of the dataset is significantly lowered.
Step V: Use-cases, applications and services
The last step refers to defining use-case scenarios and/or developing specific applications and services which will use the data from the newly created Linked Data dataset, to showcase the (re)usability of the dataset and its links to other Linked Data datasets. This will present the potential of the contextually linked datasets to future interested parties.
When creating a Linked Drug Data dataset, potential use-case scenarios, applications and services should include contextually linked data from the LODD datasets and the Bio2RDF datasets. The LODD datasets, especially the DrugBank linked dataset, contain data about generic drugs, i.e. active ingredients, along with their pharmaceutical and pharmacological properties, targets, brand names, food interactions, drug interactions, etc. Since the national drug data registries contain information about drug products, one-to-one mappings between the entities from such datasets and the DrugBank and DBpedia datasets are not possible. Instead, using our RDF schema from Step II and the OpenRefine transformation script from Step III, each entity from the dataset will be linked to one or more generic drugs/active ingredients from DrugBank and DBpedia, based on its ATC code [88]. This way, the drugs from the dataset get a contextual link to the generic drugs, and from there, to all of its properties and characteristics. Additionally, the existing links from the DrugBank and DBpedia generic drugs to other healthcare datasets can be further exploited, as they also represent contextual links. Such links currently point to LinkedCT and Bio2RDF. To demonstrate the usability of the generated Linked Drug Data datasets, we provide example use-cases on the project website [60] and on GitHub [58].
In a general case, the use-cases can be text-based scenarios, specific SPARQL queries, or prototype applications, describing the ways in which the data from the new dataset can be browsed, retrieved and used. Here, a specific focus should be given on how the links to other Linked Data datasets can be exploited to reach other data, not present in the original data source, to extend its context. With this, the data publisher will show to interested parties that the original dataset has more value when combined with datasets from the same or similar context, instead of being used in an isolated scenario. Besides such use-case, the same effects of the Linked Data dataset can be showcased with the development of applications and services. They bring more visibility to the general (re)usability of the Linked Data dataset, but generally require more time and effort.
The created use-cases, applications and/or services, should be shared and announced to the public, along with the dataset itself and its VoID metadata. The use of the same channels from the previous step is advised.
Methodology supporting tools
As part of the methodological guidelines, with the intent to provide help to the data publishers working in the drug data domain, we designed and developed a set of tools. These tools consist of the RDF schema, the CSV template, the OpenRefine transformation script, the SPARQL-based tool for interlinking related drugs and the web-based tool for automated transformation, interlinking and publishing of the generated Linked Drug Data dataset.
RDF schema
In order to model the domain of drug products on a global scale, we needed to create one common schema for all national drug data repositories, and then use it for annotating the drug data. With it, the goal was to provide alignment of drug data from different sources, with different format and different levels of data granularity, in order to enable simpler data exploitation.
First, we analyzed the national drug data repositories of 31 countries1 and the analysis helped us define a common set of properties which exist and which we want to use in our dataset. This set consisted of 24 properties, including the brand name of the drug, the generic name, the ATC code, the EAN code (barcode), the list of active ingredients, the drug strength, dosage form, cost, manufacturer, the country it was registered in, the details about its license, etc. Not all national drug data repositories provide all of the data and properties we selected for our schema, but we did not want to decide against using those properties - they are useful where available.
Following the best practices in ontology and vocabulary use [77], we started by considering reuse of classes and properties from existing vocabularies. We used the common set of properties we defined in the previous step, and found that the Schema.org vocabulary [59] was fully applicable for our set. The Schema.org vocabulary, as part of its Health and Lifesciences Extension [89], contains a definition of the class schema:Drug and contains a large set of properties applicable to it [90]. As we can see on Fig. 2, the RDF schema uses the DrugBank ontology and the RDFS ontology, as well, for interoperability purposes.
Schema.org is a joint initiative of Google, Bing, Yahoo and Yandex, as a common vocabulary intended for structured markup on web pages [91–93]. It is used by these search engines to introduce rich snippets about people, events, products, movies, restaurants, books, tv shows, etc. It is also used in Google’s Knowledge Graph, in emails confirming reservations and receipts (from restaurants, hotels, airlines, etc.) both from Gmail and Microsoft’s Cortana, it is used for rich pins on Pinterest, as well as from Apple’s Siri [94]. Its use on the Web has been increasing in the past few years, more rapidly than the more rigorous general-purpose vocabularies and ontologies before it [95]. Its success is mainly attributed to its simplicity: it uses a generally flat hierarchy of classes, so that the boundaries of implementation from data publishers and webmasters is kept low.
The growing use of the Schema.org vocabulary, as well as its domain generality, has put the vocabulary in a position in which it is being used for aligning existing ontologies and datasets. This is happening in the healthcare domain, as well [96]. With the release of Schema.org version 3.0 [97], the medical and healthcare related terms [98] have been moved to the Health and Livesciences Extension [89], to enable and ensure future collaborative development of the terms by the Healthcare Schema Vocabulary community group at W3C [99, 100]. This plan for a long-term support by the community from the domain instills sufficient certainty for us to choose the Schema.org vocabulary, instead of the domain specific ontologies [76], to provide a common schema for drug products on a global scale.
In order to provide some alignment between the generated datasets and the LODD and DrugBank datasets, we use several properties from the DrugBank ontology to describe the drug products. More specifically, we use drugbank:brandName, drugbank:genericName, drugbank:atcCode and drugbank:dosageForm as additional properties for the same values denoted by schema:name, schema:nonProprietaryName, schema:code and schema:dosageForm, respectively. We do this for simplifying the SPARQL federated queries when working with data from our LinkedDrugs dataset and the DrugBank dataset. Additionally, each drug product from the LinkedDrugs dataset is an instance of a specific class from the ATC Classification Ontology [101], in order to classify the drug according to the ATC classification system, based on its ATC code(s). We also chose rdfs:seeAlso as it is the most widely used relation for interlinking similar entities in the LOD Cloud [33].
Just as any other RDF schema, vocabulary and ontology, the RDF schema selected for our Linked Drug Data datasets can be evolve in time; it can be extended and modified in the future by us or third-parties, as the field of drug data evolves.
CSV template
In order to enable data publishers to annotate their drug data with the RDF schema from Fig. 2, we need a formal template for the data which is being prepared for transformation, and a formal transformation process. For the former, we define a CSV template, available publicly and as open-source [58]. The CSV template contains 39 columns which represent the different data fields needed from the source data for the transformation process. They inlude the URI of the drug, its brand name, generic name(s), manufacturer(s), ATC code(s), active ingredient(s), strength, cost, etc. They are modelled to fit with the RDF schema, which encompasses all data necessary for high-quality modeling of the domain.
The data type of the different columns is usually a simple text value, except where we note otherwise. Some important notes regarding the field data types include: the strength value is divided into an integer-value column denoting the strength, while the unit is part of a text-value column denoting the strenght unit; similarly, the cost of the drug is separated into a float value and a currency value; the several date columns need to be formatted as “YYYY-MM-DD”; the prescription status should be enumerated as either “OTC” or “PrescriptionOnly”; the currency code needs to comply with the ISO standard for denoting currencies [102]; the country where the drug is registered in needs to be denoted using a country code accoding to an ISO standard [63]; if there are multiple generic names, manufacturers or active ingredients, they should be denoted one-per-column in the available genericNameN, manufacturerN and activeIngredientN columns, respectively, etc. The details about the other column data types are available on the project website [58].
The CSV template uses a vertical line character (|) as a delimiter, since the regular CSV separators such as a comma (,) and a semicolon (;) are very often present in the cell values when working with drug data, and can be misinterpreted. It is important to note that the order of the columns in the CSV template is not relevant, if used with our OpenRefine transformation script.
As with the RDF schema, the CSV template is open and publicly available, and therefore can be extended or modified in the future by both us and third-parties, as the drug data field evolves and more Linked Drug Data dataset are being created.
OpenRefine transformation script
Step III of the methodology contains the task of transforming the source data into the RDF schema selected in Step II. Since we defined an RDF schema which can be applied in the drug data domain for drug products which are registered in different countries, we also provide a tool which can help automate the transformation process, while ensuring compliance of the generated data with the defined RDF schema and therefore providing aligned, high-quality 5-star Linked Data for the drug domain. The intent of this tool is to lower the bounds of transforming data into RDF and Linked Data for data publishers which are not deeply involved and experienced in the Semantic Web and Linked Data practices.
We provide this Linked Data generation tool in the form of an OpenRefine transformation script. OpenRefine [81] is an open-source software for working with structured data, usually CSV, TSV, XML, etc. It provides users with functionalities for working with large datasets: the users can record their action over a small set of example rows, and then apply them over the entire source data. Here, the actions can include data transformations, merging, data cleaning tasks, manipulation of the columns, etc. It also has an RDF extension which allows reconciliation of cell values against RDF data from SPARQL endpoints. This allows linking of cell values with entities from a SPARQL endpoint, for unambiguous identification of entities. It also allows mapping of the source data into RDF, by defining an ‘RDF skeleton’. The output of this action is an RDF file generated from the source data, according to the definitions in the ‘RDF skeleton’.
The ability of OpenRefine to save the user actions and then export them in JSON format, allows reuse of certain sets of actions for different datasets. This gives us the ability to define the data transformation which can be reused over different source drug datasets, which have the same columns. As we have a CSV template, we can use this as part as our set of tools. The defined list of data transformation actions we created is what we have as our OpenRefine transformation script [58].
Our OpenRefine transformation script is designed for data complying with the CSV template, and its output is a Linked Drug Data dataset which uses our RDF schema. The transformation script contains three actions:
-
A.
reconcile the columns genericName1, genericName2,..., genericName5 against DBpedia,
-
B.
reconcile the column atcCode against the DrugBank dataset, and
-
C.
create an RDF schema skeleton
Action A. uses the RDF extension feature of OpenRefine which uses the cell value from a selected column to find potential entities from a given SPARQL endpoint which can be matched to the entity denoted by the row. In our case, we use the five genericName columns - which hold the generic name of the active ingredient of the drug entity - and we try to match each of them to a dbo:Drug entity from the DBpedia SPARQL endpoint, using its rdfs:label value. If the reconciliation service finds a matching candidate entity, we use it in step C. to create an RDF triple which links the drug entity from our CSV dataset with the matched candidates from DBpedia, via an rdfs:seeAlso relation, for instance:
Action B. does a similar reconciliation, but on the atcCode column from the CSV dataset and against the DrugBank endpoint. It tries to find matches between the value of the atcCode column on our side and the drugbank:atcCode value of drugbank:drugs instances from the endpoint. Unlike the situation in A., here we can have more than one matching candidate from DrugBank. The reason is that there can be multiple drugbank:drugs instances which have the same ATC code, i.e. share the same therapeutic, pharmacological and chemical properties. Similar as in A., we use all matching candidates from the reconciliation in step C. to create RDF triples which link the drug entity from our CSV dataset to the matched drug entities from DrugBank, such as:
Action C. creates the RDF schema skeleton, which contains the rules for mapping the consolidated CSV file into RDF. In the RDF schema skeleton (Fig. 2), we define mappings between the CSV columns and certain RDF triple patterns. Some of the mappings are straight-forward, such as the mappings of the brand name, the generic name, the dosage form, the country, the url, the description, etc. For them, we define the URI of the drug as a subject, we denote a specific property for the triple, and then we define the value of the column as a literal or an object of the triple. For instance, the brand name of a drug is mapped into RDF triples with the following format:
However, other mappings are more complex. Mappings of values such as the ATC code, the cost, the strength, the manufacturer, the license details, etc., need new entities to be created, entities of different types. For instance, in order to add the information about the ATC code to the drug entity, we need to create a new blank node of type schema:MedicalCode, which has two additional triples: one with the schema:codeValue property and one with the schema:codingSystem property. This ATC code mapping can be represented with:
The license mappings were the most complex, which we can see from Fig. 2. Aside from using OpenRefine’s user interface for defining the RDF skeleton, we used its GREL language for mapping the reconciliation results from actions A. and B. into rdfs:seeAlso triples.
As a result of the transformation script, a Linked Data dataset with links to the LOD Cloud is created. Similarly as the other tools, the transformation script is available as an open-source JSON file, which can be extended and modified in the future. As a support for it, we also developed a BASH script which sends the CSV file with drug data, formatted according to our CSV template, along with the OpenRefine transformation script to a running BatchRefine service [58]. The result from this call is the RDF output representing the transformed dataset.
SPARQL-based tool for extending and interlinking the dataset
Once the drug dataset is transformed into a Linked Data dataset with the other tools, an additional step is required in Step III to create the internal links between drugs which share the same functionality, i.e. share the same therapeutic, pharmacological and chemical properties, in order to create a better basis for use-cases. We need to create links between drugs from the dataset which have the same function, i.e. are aimed to treat the same condition. To create these links, we use drug’s ATC codes. According to the World Health Organization coding scheme [88], if two drugs have the same ATC code, they share the same function. For this purpose, we define a reusable SPARQL query [58] which detects all pairs of drugs from the dataset which have the same ATC code, and using the schema:relatedDrug property creates a pair of triples for them, for instance:
These two triples create a two-way link between the drugs in the dataset, denoting their functional similarity. The SPARQL query results with storing the newly created RDF triples in the same RDF graph where the dataset is already stored. These interlinkings can be utilizes for providing the users with alternative drugs they may require for treating their condition, either in the same of in a different country.
Since not all source registries contain the ATC code information, and in order to increase the number of interlinked drug products from the dataset and support better data analytics, we define an additional reusable SPARQL query [58] which assignes ATC codes to all drugs from the dataset which miss this information. The SPARQL query detects drug products without an ATC code, finds the generic drug from DBpedia which the drug is linked to with the rdfs:seeAlso relation, gets the ATC code of the DBpedia generic drug and assigns it to the drug product in question. Since the SPARQL query for interlinking drugs from the dataset depends on the ATC code, this SPARQL query for extending the dataset with missing ATC code values should be executed first.
Both SPARQL queries are parametrized and should be edited before execution. They can be executed over the Linked Data storage used for storing the Linked Data dataset generated with the other tools.
Web-based tool for automated transformation, interlinking and publishing
The generated Linked Drug Data dataset needs to be published on the Web according to the Linked Data principles and best practices, as advised in Step IV. In order to aid the data publishers, this step can be automatically executed by using a web-based tool we provide. The data publishers can upload the generated Linked Data dataset(s) on the LinkedDrugs project website [60], and after a human-based quality assessment, the dataset will be automatically published. For this we use a publicly available Virtuoso instance [61], from which the new dataset is available on the Web as Linked Data, via its SPARQL endpoint [62]. The RDF graph identifier is returned to the data publisher after the successful upload process.
Besides publishing finished Linked Drug Data datasets, the web-based tool and its automated process can also execute the previous steps of the methodology for the data publisher: (a) they can generate an interlinked Linked Data dataset from an input CSV file, and (b) they can interlink drugs with schema:relatedDrug relations from an input RDF file. For the former, the uploaded CSV file needs to be generated following our CSV template, and based on it, the predefined RDF schema and the OpenRefine transformation script, our web-based tool and its server-side process will generate the Linked Data dataset. Using the SPARQL-based tool from above, it will then generate links between the drugs from the dataset, based on their ATC codes. For the latter, the web-based tool directly creates the schema:relatedDrug relations between similar drugs from the uploaded Linked Drug Data dataset in RDF. With this, we provide the convenience to move most of the data processing from the methodological guidelines away from the data publishers, and simplify their workflow.
When a data publisher uses our web-based tool at [60] to publish a Linked Drug Data dataset, our system also adds it to the global Linked Drug Data dataset - the LinkedDrugs dataset - by storing it in another RDF graph and generating schema:relatedDrug triples for linking the drugs from the new dataset with the drugs from the existing datasets in LinkedDrugs, and vice-versa. The LinkedDrugs dataset then contains data for drug products provided by different publishers, including our team, and is available via a permanent, dereferenceable URI, which supports HTTP content negotiation [65].