Structuring research methods and data with the research object model: genomics workflows as a case study

Hettne, Kristina M; Dharuri, Harish; Zhao, Jun; Wolstencroft, Katherine; Belhajjame, Khalid; Soiland-Reyes, Stian; Mina, Eleni; Thompson, Mark; Cruickshank, Don; Verdes-Montenegro, Lourdes; Garrido, Julian; de Roure, David; Corcho, Oscar; Klyne, Graham; van Schouwen, Reinout; ‘t Hoen, Peter A C; Bechhofer, Sean; Goble, Carole; Roos, Marco

doi:10.1186/2041-1480-5-41

Research
Open access
Published: 18 September 2014

Structuring research methods and data with the research object model: genomics workflows as a case study

Kristina M Hettne¹,
Harish Dharuri¹,
Jun Zhao³,
Katherine Wolstencroft^2,6,
Khalid Belhajjame²,
Stian Soiland-Reyes²,
Eleni Mina¹,
Mark Thompson¹,
Don Cruickshank³,
Lourdes Verdes-Montenegro⁵,
Julian Garrido⁵,
David de Roure³,
Oscar Corcho⁴,
Graham Klyne³,
Reinout van Schouwen¹,
Peter A C ‘t Hoen¹,
Sean Bechhofer²,
Carole Goble² &
…
Marco Roos¹

Journal of Biomedical Semantics volume 5, Article number: 41 (2014) Cite this article

5375 Accesses
20 Citations
11 Altmetric
Metrics details

Abstract

Background

One of the main challenges for biomedical research lies in the computer-assisted integrative study of large and increasingly complex combinations of data in order to understand molecular mechanisms. The preservation of the materials and methods of such computational experiments with clear annotations is essential for understanding an experiment, and this is increasingly recognized in the bioinformatics community. Our assumption is that offering means of digital, structured aggregation and annotation of the objects of an experiment will provide necessary meta-data for a scientist to understand and recreate the results of an experiment. To support this we explored a model for the semantic description of a workflow-centric Research Object (RO), where an RO is defined as a resource that aggregates other resources, e.g., datasets, software, spreadsheets, text, etc. We applied this model to a case study where we analysed human metabolite variation by workflows.

Results

We present the application of the workflow-centric RO model for our bioinformatics case study. Three workflows were produced following recently defined Best Practices for workflow design. By modelling the experiment as an RO, we were able to automatically query the experiment and answer questions such as “which particular data was input to a particular workflow to test a particular hypothesis?”, and “which particular conclusions were drawn from a particular workflow?”.

Conclusions

Applying a workflow-centric RO model to aggregate and annotate the resources used in a bioinformatics experiment, allowed us to retrieve the conclusions of the experiment in the context of the driving hypothesis, the executed workflows and their input data. The RO model is an extendable reference model that can be used by other systems as well.

Availability

The Research Object is available at http://www.myexperiment.org/packs/428

The Wf4Ever Research Object Model is available at http://wf4ever.github.io/ro

Background

One of the main challenges for biomedical research lies in the integrative study of large and increasingly complex combinations of data in order to understand molecular mechanisms, for instance to explain the onset and progression of human diseases. Computer-assisted methodology is needed to perform these studies, posing new challenges for upholding scientific quality standards for the reproducibility of science. The aim of this paper is to describe how the research data, methods and metadata related to a workflow-centric computational experiment can be aggregated and annotated using standard Semantic Web technologies, with the purpose of helping scientists performing such experiments in meeting requirements for understanding, sharing, reuse and repurposing.

The workflow paradigm is gaining ground in bioinformatics as the technology of choice for recording the steps of computational experiments [1–4]. It allows scientists to delineate the steps of a complex analysis and expose this to peers using workflow design and execution tools such as Taverna [5], and Galaxy [6], and workflow sharing platforms such as myExperiment [7] and crowdLabs [8]. In a typical workflow, data outputs are generated from data inputs via a set of (potentially distributed) computational tasks that are coordinated following a workflow definition. However, workflows do not provide a complete solution for aggregating all data and all meta-data that are necessary for understanding the full context of an experiment. Consequently, scientists often find it difficult (or impossible) to reuse or repurpose existing workflows for their own analyses [9]. In fact, insufficient meta-data has been listed as one of the main causes of workflow decay in a recent study of Taverna workflows on myExperiment [9]. Workflow decay is the term used when the ability to re-execute a workflow after its inception has been compromised.

We will be able to better understand scientific workflows if we are able to capture more relevant data and meta-data about them; including the purpose and context of the experiment, sample input and output datasets, and the provenance of workflow executions. Moreover, if we wish to publish and exchange these resources as a unit, we need a mechanism for aggregation and annotation that would work in a broad scientific community. Semantic Web technology seems a logic choice of technology, given its focus on capturing the meaning of data in a machine readable format that is extendable and supports interoperability. It allows defining a Web-accessible reference model for the annotation of the aggregation and the aggregated resources that is independent of how data are stored in repositories. Examples of other efforts where Semantic Web technology has been used for the biomedical data integration includes the Semantic Enrichment of the Scientific Literature (SESL) [10] and Open PHACTS [11] projects. We applied the recently developed Research Object (RO) family of tools and ontologies [12, 13] to preserve the scientific assets and their annotation related to a computational experiment. The concept of the RO was first proposed as an abstraction for sharing research investigation results [14]. Later, the potential role for ROs in facilitating not only the sharing but also the reuse of results, in order to increase the reproducibility of these results, was envisioned [15]. Narrowing down to workflow-centric ROs, preservation aspects were explored in [16], and their properties as first class citizen structures that aggregate resources in a principled manner in [13]. We also showed the principle of describing a (text mining) workflow experiment and its results by Web Ontology Language (OWL) ontologies [17]. The OWL ontologies were custom-built, which we argue is now an unnecessary bottleneck for exchange and interoperability. These studies all contributed to the understanding and implementation of the concept of an RO, but the data used were preliminary, and the studies were focused on describing workflows with related datasets and provenance information, rather than from the viewpoint of describing a scientific experiment of which workflows are a component.

A workflow-centric RO is defined as a resource that aggregates other resources, such as workflow(s), provenance, other objects and annotations. Consequently, an RO represents the method of analysis and all its associated materials and meta-data [13, 15], distinguishing it from other work mainly focusing on provenance of research data [18, 19]. Existing Semantic Web frameworks are used, such as (i) the Object Exchange and Reuse (ORE) model [20]; (ii) the Annotation Ontology (AO) [21]; and (iii) the W3C-recommended provenance exchange models [22]. ORE defines the standards for the description and exchange of aggregations of Web resources and provides the basis for the RO ontologies. AO is a general model for annotating resources and is used to describe the RO and its constituent resources as well as the relationships between them. The W3C provenance exchange models enable the interchange of provenance information on the Web, and the Provenance Ontology (PROV-O) forms the basis for recording the provenance of scientific workflow executions and their results.

In addition, we used the minimal information model “Minim”, also in Semantic Web format, to specify which elements in an RO we consider “must haves”, “should haves” and “could haves” according to user-defined requirements [23]. A checklist service subsequently queries the Minim annotations as an aid to make sufficiently complete ROs [24]. The idea of using a checklist to perform quality assessment is inspired by related checklist-based approaches in bioinformatics, such as the Minimum Information for Biological and Biomedical Information (MIBBI)-style models [25].

Case study: genome wide association studies

As real-world example we aggregate and describe the research data, methods and metadata of a computational experiment in the context of studies of genetic variation in human metabolism. Given the potential of genetic variation data in extending our understanding of genetic diseases, drug development and treatment, it is crucial that the steps leading to new biological insights can be properly recorded and understood. Moreover, bioinformatics approaches typically involve aggregation of disparate online resources into complex data parsing pipelines. This makes this a fitting test case for an instantiated RO. The biological goal of the experiment is to aid in the interpretation of the results of a Genome-Wide Association Study (GWAS) by relating metabolic traits to the Single Nucleotide Polymorphisms (SNPs) that were identified by the GWAS. GWA studies have successfully identified genomic regions that dispose individuals to diseases (see for example [26], for a review see [27]). However, the underlying biological mechanisms often remain elusive, which led the research community to evince interest in genetic association studies of metabolites levels in blood (see for example [28–30]). The motivation is that the biochemical characteristics of the metabolite and the functional nature of affected genes can be combined to unravel biological mechanisms and gain functional insight into the aetiology of a disease. Our specific experiment involves mining curated pathway databases and a specific text mining method called concept profile matching [31, 32].

In this paper we describe the current state of RO ontologies and tools for the aggregation and annotation of a computational experiment that we developed to elucidate the genetic basis for human metabolic variation.

Methods

We performed our experiment using workflows developed in the open source Taverna Workflow Management System version 2.4 [5]. To improve the understanding of the experiment, we have added the following additional resources to the RO, using the RO-enabled myExperiment [33]: 1) the hypothesis or research question (what the experiment was designed to test); 2) a workflow-like sketch of the overall experiment (the overall data flow and workflow aims); 3) one or more workflows encapsulating the computational method; 4) input data (a record of the data that were used to reach the conclusions of an experiment); 5) provenance of workflow runs (the data lineage paths built from the workflow outputs to the originating inputs); 6) the results (a compilation of output data from workflow runs); 7) the conclusions (interpretation of the results from the workflows against the original hypothesis). Such an RO was then stored in the RO Digital Library [34]. RO completeness evaluation is checked from myExperiment with a tool implementing the Minim model [24]. Detailed description of the method follows.

Workflow development

We developed three workflows for interpreting SNP-metabolite associations from a previously published genome-wide association study, using pathways from the KEGG metabolic pathway database [35] and Gene Ontology (GO) [36] biological process associations from text mining of PubMed. To understand an association of a SNP with a metabolite, researchers would like to know the gene in the vicinity of the SNP that is affected by the polymorphism. Then, researchers examine the functional nature of the gene and evaluate if it makes sense given the biochemical characteristics of the metabolite with which it is associated. This typically involves interrogation of biochemical pathway databases and mining existing literature. We would like to evaluate the utility of background knowledge present in the databases and literature in facilitating a biological interpretation of the statistically significant SNP-metabolite pairs. We do this by first determining the genes closest to the SNPs, and then reporting the pathways that these genes participate in. We implemented two main workflows for our experiment. The first one mines the manually curated KEGG database of metabolic pathway and gene associations that are available via the KEGG REST Services [37]. The second workflow mines the text-mining based database of associations between GO biological processes and genes behind the Anni 2.1 tool [31] that are available via the concept profile mining Web services [38]. We also created a workflow to list all possible concept sets in the concept profile database, to encourage reuse of the concept profile-based workflow for matching against other concept sets than GO biological processes. The workflows were developed following the 10 Best Practices for workflow design [39]. The Best Practices were developed to encourage re-use and prevent workflow decay, and briefly consists of the following steps:

1)
Make a sketch workflow to help design the overall data flow and workflow aims, and to identify the tools and data resources required at each stage. The sketch could be created using for example flowchart symbols, or empty beanshells in Taverna.
2)
Use modules, i.e. implement all executable components as separate, runnable workflows to make it easier for other scientists to reuse parts of a workflow at a later date.
3)
Think about the output. A workflow has the potential to produce masses of data that need to be visualized and managed properly. Also, workflows can be used to integrate and visualise data as well as for analysing it, so one should consider how the results will be presented easily to the user.
4)
Provide input and output examples to show the format of input required for the workflow and the type of output that should be produced. This is crucial for the understanding, validation, and maintenance of the workflow.
5)
Annotate, i.e. choose meaningful names for the workflow title, inputs, outputs, and for the processes that constitute the workflow as well as for the interconnections between the components, so that annotations are not only a collection of static tags but capture the dynamics of the workflow. Accurately describing what individual services do, what data they consume and produce, and the aims of the workflow are all essential for use and reuse.
6)
Make it executable from outside the local environment by for example using remote Web services, or platform independent code/plugins. Workflows are more reusable if they can be executed from anywhere. If there is need to use local services, library or tools, then the workflow should be annotated in order to define its dependencies.
7)
Choose services carefully. Some services are more reliable or more stable than others, and examining which are the most popular can assist with this process.
8)
Reuse existing workflows by for example searching collaborative platforms such as myExperiment for workflows using the same Web service. If a workflow has been tried, tested and published, then reusing it can save a significant amount of time and resource.
9)
Test and validate by defining test cases and implementing validation mechanisms in order to understand the limitations of workflows, and to monitor changes to underlying services.
10)
Advertise and maintain by publishing the workflow on for example myExperiment, and performing frequent testing of the workflow and monitoring of the services used. Others can only reuse it if it is accessible and if it is updated when required, due to changes in underlying services.

The RO core model

The RO model [12, 13] aims at capturing the elements that are relevant for interpreting and preserving the results of scientific investigations, including the hypothesis investigated by the scientists, the data artefacts used and generated, as well as the methods and experiments employed during the investigation. As well as these elements, to allow third parties to understand the content of the RO, the RO model caters for annotations that describe the elements encapsulated by the ROs, as well as the RO as a whole. Therefore, two main constructs are at the heart of the RO model, namely aggregation and annotation. The work reported on in this article uses version 0.1 of the RO model, which is documented online [12].

Following myExperiment packs [7], ROs use the ORE model [20] to represent aggregation. Using ORE, an RO is defined as a resource that aggregates other resources, e.g., datasets, software, spreadsheets, text, etc. Specifically, the RO extends ORE to define three new concepts: i) ro:ResearchObject is a sub-class of ore:Aggregation which represents an aggregation of resources. ii) ro:Resource is a sub-class of ore:AggregatedResource representing a resource that is aggregated within an RO. iii) ro:Manifest is a sub-class of ore:ResourceMap, representing a resource that is used to describe the RO.

To support the annotation of ROs, their constituent resources, as well as their relationship, we use the Annotation Ontology [21]. Several types of annotations are supported by the Annotation Ontology, e.g., comments, textual annotations (classic tags) and semantic annotations, which relate elements of the ROs to concepts from underlying domain ontologies. We make use of the following Annotation Ontology terms: i) ao:Annotation, which acts as a handle for the annotation. ii) ao:annotatesResource, which represents the resource(s)/RO(s) subjects to annotation. iii) ao:body, which describes the target of the annotation. The body of the annotation takes the form of a set of Resource Description Framework (RDF) statements. Note that it is planned for later revisions of the RO model to use the successor of AO, the W3C Community Open Annotation Data Model (OA) [40]. For our purposes, OA annotations follows a very similar structure using oa: Annotation, oa:hasTarget and oa:hasBody.

Support for workflow-centric ROs

A special kind of ROs that are supported by the model is what we call workflow-centric ROs, which, as indicated by the name, refer to those ROs that contain resources that are workflow specifications. The structure of the workflow in ROs is detailed using the wfdesc vocabulary [41], and is defined as a graph in which the nodes refers to steps in the workflow, which we call wfdesc:Process, and the edges representing data flow dependencies, wfdesc:DataLink, which is a link between the output and input parameters (wfdesc:Parameter) of the processes that compose the workflow. As well as the description of the workflow, workflow centric ROs support the specification of the workflow runs, wfprov:WorkflowRun, that are obtained as a result of enacting workflows. A workflow run is specified using the wfprov ontology [42], which captures information about the input used to feed the workflow execution, the output results of the workflow run, as well as the constituent process runs, wfprov:ProcessRun, of the workflow run, which are obtained by invoking the workflow processes, and the input and outputs of those process runs.

Support for domain-specific information

A key aspect of the RO model design is the freedom to use any vocabulary. This allows for inclusion of very domain-specific information about the RO if that serves the desired purpose of the user. We defined new terms under the name space roterms [43]. These new terms serve two main purposes. They are used to specify annotations that are, to our knowledge, not catered for by existing ontologies, e.g., the classes roterms:Hypothesis and roterms:Conclusion to annotate the hypothesis and conclusions part of an RO, and the property roterms:exampleValue to annotate an example value for a given input or output parameter given as an roterms:WorkflowValue instance. The roterms are also used to specify shortcuts that make the ontology easy to use and more accessible. For example, roterms:inputSelected associates a wfdesc:WorkflowDefinition to an ro:Resource to state that a file is meant to be used with a given workflow definition, without specifying at which input port or in which workflow run.

Minim model for checklist evaluation

When building an RO in myExperiment users are provided with a mechanism of quality insurance by our so-called checklist evaluation tool, which is built upon the Minim checklist ontology [23, 44] and defined using Web Ontology Language. Its basic function is to assess that all required information and descriptions about the aggregated resources are present and complete. Additionally, according to explicit requirements defined in a checklist, the tool can also assess the accessibility of those resources aggregated in an RO, in order to increase the trust on the understanding of the RO. The Minim model has four key components, as illustrated by Figure 1: 1) a Constraint, which associates a model (checklist) to use with an RO, for a specific assessment purpose, e.g. reviewing an RO containing sufficient information before being shared; 2) a Model, which enumerates of the set of requirements to be considered, which may be declared at levels of MUST, SHOULD or MAY be satisfied for the model as a whole; 3) a Requirement, which is the key part for expressing the concrete quality requirements to an RO, for example, the presence of certain information about an experiment, or liveness (accessibility) of a data server; 4) a Rule, which can be a SoftwareRequirementRule, to specify the software to be present in the operating environment, a ContentMatchRequirementRule, to specify the presence of certain pattern in the assessed data, or a DataRequirementRule, for specifying data resource to be aggregated in an RO.

RO digital library

While myExperiment acts mainly as front-end to users, the RO Digital Library [34] acts as a back-end, with two complementary storage components: a digital repository to keep the content, as a triple store to manage the meta-data content. The ROs in the repository can be accessed via a Restful API [45] or via a public SPARQL endpoint [46]. All the ROs created in the myExperiment.org are also submitted to the RO Digital Library.

Workflow-centric RO creation process

Below we describe the steps that we conducted when creating the RO for our case study in an “RO-enabled” version of myExperiment [33]. The populated RO is intended to contain all the information required to re-run the experiment, or understand the results presented, or both.

Creating an RO

The action of creating an RO consists of generating the container for the items that will be aggregated, and getting a resolvable identifier for it. In myExperiment the action of creating an RO is similar to creating a pack. We filled in a title and description of the RO at the point of creation and got a confirmation that the RO had been created and had been assigned a resolvable identifier in the RO Digital Library (Figure 2).

Adding the experiment sketch

Using a popular office presentation tool, we made an experiment sketch and saved it as a PNG image. We then uploaded the image to the pack, selecting the type “Sketch”. As a result, the image gets stored in the Digital Library and aggregated in the RO. In addition, an annotation was added to the RO to specify that the image is of type “Sketch”. A miniature version of the sketch is shown within the myExperiment pack (Figure 3).

Adding the hypothesis

To specify the hypothesis, we created a text file that describes the hypothesis, and then upload it to the pack as type “Hypothesis”. The file gets stored in the Digital Library and aggregated in the RO, this time annotated to be of type “Hypothesis”.

Adding workflows

We saved the workflow definitions to files and uploaded them to the pack as type “Workflow”. MyExperiment then automatically performed a workflow-to-RDF transformation in order to extract the workflow structure according to the RO model, which includes user descriptions and metadata created within the Taverna workbench. The descriptions and the extracted structure gets stored in the RO Digital Library and associated with the workflow files as annotations.

Adding the workflow input file

The data values were stored in files that were then uploaded into the pack as “Example inputs”. Such files gets stored in the RO Digital Library and aggregated in the RO, and as “Example inputs”.

Adding the workflow provenance

Using the Taverna-Prov [47] extension to Taverna, we exported the workflow run provenance to a file that we uploaded to the pack as type “Workflow run”. Similar to other resources, the provenance file gets stored in the digital library with the type “Workflow run”, however as the file is in the form of RDF according to the wfprov [42] and W3C PROV-O [22] ontologies, it is also integrated into the RDF store of the digital library and available for later querying.

Adding the results

We made a compilation of the different workflow outputs to a result file in table format, uploaded to the pack as type “Results”. The file gets stored in the digital library and aggregated in the RO, annotated to be of the type “Results”.

Adding the conclusions

To specify the hypothesis, we created a text file that describes the hypothesis, and then uploaded it to the pack as type “Hypothesis”. The file gets stored in the digital library and aggregated in the RO, annotated to be of type “Conclusions”.

Intermediate step: checklist evaluation

At this point we checked how far we were from satisfying the Minim model, and were informed by the tool that the RO now fully satisfies the checklist (Figure 4).

Annotating and linking the resources

We linked the example input file to the workflows that used the file by the property “Input_selected” (Figure 5). In this particular case, both workflows have the same inputs but they need to be configured in different ways. This is described in the workflow description field in Taverna.

Results

The RO for our experiment is the container for the items that we wished to aggregate. In terms of RDF, we first instantiated an ro:ResearchObject in an RO-enabled version of myExperiment [33]. We thereby obtained a unique and resolvable Uniform Resource Identifier (URI) from the RO store that underlies this version of myExperiment. In our experimental setup this was http://sandbox.wf4ever-project.org/rodl/ROs/Pack405/. It is accessible from myExperiment [48]. Each of the subsequent items in the RO was aggregated as an ro:Resource, indicating that the item is considered a constituent member of the RO from the point of view of the scientist (the creator of the RO).

Aggregated resources

We aggregated the following items: 1) the hypothesis (roterms:Hypothesis): we hypothesized that SNPs can be functionally annotated using metabolic pathway information complemented by text mining, and that this will lead to formulating new hypotheses regarding the role of genomic variation in biological processes; 2) the sketch (roterms:Sketch) shows that our experiment follows two paths to interpret SNP data: matching with concept profiles and matching with Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways (Figure 3); 3) the workflows (wfdesc:Workflow): Figure 6 shows the workflow diagram for the KEGG workflow and Figure 7 shows the workflow diagram for the concept profile matching workflow. In Taverna, we aimed to provide sufficient annotation of the inputs, outputs and the functions of each part of the workflow to ensure a clear interpretation and to ensure that scientists know how to replay the workflows using the same input data, or re-run them with their own data. We provided textual descriptions in Taverna of each step of the workflow, in particular to indicate their purpose within the workflow (Figure 8); 4) the input data (roterms:exampleValue) that we aggregated in our RO was a list of example SNPs derived from the chosen GWAS [28]; 5) the workflow run provenance (roterms:WorkflowRunBundle): a ZIP archive that contains the intermediate values of the workflow run, together with its provenance trace expressed using wfprov:WorkflowRun and subsequent terms from the wfprov ontology. We thus stored process information from the input of the workflow execution to its output results, including the information for each constituent process run in the workflow run, modelled as wfprov:ProcessRun. The run data is: 3 zip files containing 2090 intermediate values as separate files totalling 9.7 MiB, in addition to 5 MiB of provenance traces; 6) the results (roterms:Result) were compiled from the different workflow outputs to one results file (see result document in the RO [49] Additional file 1). For 15 SNPs it lists the associated gene name, the biological annotation from the GWAS publication, the associated KEGG pathway, and the most strongly associated biological process according to concept profile matching. Our workflows were able to compute a biological annotation from KEGG for 10 out of 15 SNPs and 15 from mining PubMed. All KEGG annotations and most text mining annotations corresponded to the annotations by Illig et al [28]. An important result of the text mining workflow was the SNP-annotation “rs7156144- stimulation of tumor necrosis factor production”, which represents a hypothetical relation that to our knowledge was not reported before; 7) the conclusions (roterms: Conclusion): we concluded that our KEGG and text mining workflows were successful in retrieving biological annotations for significant SNPs from a GWAS experiment, and predicting novel annotations.

As an example of our instantiated RO, Figure 9 provides a simplified view of the RDF graph that aggregates and annotates the KEGG mining workflow. It shows the result of uploading our Taverna workflow to myExperiment, as it initiated an automatic transformation from a Taverna 2 t2flow file to a Taverna 3 workflow bundle, while extracting the workflow structure and user descriptions in terms of the wfdesc model [41]. The resulting RDF document was aggregated in the RO and used as the annotation body of a ao:Annotation on the workflow, thus creating a link between the aggregated workflow file and its description in RDF. The Annotation Ontology uses named graphs for semantic annotation bodies. In the downloadable ZIP archive of an RO each named graph is available as a separate RDF document, which can be useful in current RDF triple stores that do not yet fully support named graphs. The other workflows were aggregated and annotated in the same way. The RO model further uses common Dublin Core vocabulary terms [50] for basic metadata such as creator, title, and description.

In some cases we manually inserted specified relations between the RO resources via the myExperiment user interface. An example is the link between input data and the appropriate workflow for cases when an RO has multiple workflows and multiple example inputs. In our case, both workflows have the same inputs, but they need to be configured in different ways. This was described in the workflow description field in Taverna which becomes available from an annotation body in the workflow upload process.

Checking for completeness of an RO: application of the Minim model

We also applied Semantic Web technology for checking the completeness of our RO. We implemented a checklist for the items that we consider essential or desirable for understanding a workflow-based experiment by annotating the corresponding parts of the RO model with the appropriate term from the Minim vocabulary (Table 1). Thus, some parts were annotated as “MUST have” with the property minim:hasMustRequirement (e.g. at least one workflow definition), and others as “SHOULD have” with the property minim:hasShouldRequirement (e.g. the overall sketch of the experiment). The complete checklist document can be found online in RDF format [51] and in a format based on the spreadsheet description of the workflow [52]. We subsequently used a checklist service that evaluates if an RO is complete by executing SPARQL queries on the Minim mappings. The overall result is a summary of the requirement levels associated with the individual items; e.g. a missing MUST requirement is a more serious omission than a missing SHOULD (or COULD) requirement. We justified the less strict requirements for some items to accommodate cases when an RO is used to publish a method as such. We found that treating the requirement levels as mutually exclusive (hence not sub properties) simplifies the implementation of checklist evaluation, and in particular the generation of results when a checklist item is not satisfied.

Table 1 RO items checklist

Full size table

Discussion

In this paper we explored the application of the Semantic Web encoded RO model to provide a container data model for preserving sufficient information for researchers to understand a computational experiment. We found that the model indeed allowed us to aggregate the necessary material together with sufficient annotation (both for machines and humans). Moreover, mapping of selected RO model artefacts to the Minim vocabulary allowed us to check if the RO was complete according to our own predefined criteria. The checklist service can be configured to accommodate different criteria. Research groups may have different views on what is essential, but also libraries or publishers may define their own standards, enabling partial automation of the process of checking a submission against specific instructions to authors. Furthermore, the service can be run routinely to check for workflow decay, in particular decay related to references that go missing.

In using the RO model, we sought to meet requirements for sharing, reuse and repurposing, as well as interoperability and reproducibility. This fits with current trends to enhance reproducibility and transparency of science (e.g. see [53–55]). Reproducibility in computational science has been defined as a spectrum [55], where a computational experiment that is described only by a publication is not seen as reproducible, while adding code, data, and finally the linked data and execution data will move the experiment towards full replication. Adhering to this definition, our RO-enabled computational experiment comes close to fulfilling the ultimate golden standard of full replication, but falls short because it has not been analyzed using independently collected data. The benefit offered by the RO in terms of reproducibility is that it provides a context (RO) within which an evaluation of reproducibility can be performed. It does this by providing an enumerated and closed set of resources that are part of the experiment concerned, and by providing descriptive metadata (annotations) that may be specific to that context. This is not necessarily the complete solution to reproducible research, but at least an incremental step in that direction.

We have used RDF as the underlying data model for exchanging ROs. One of advantages is the ability to query the data, which becomes clear when we want to answer questions about the experiment, such as: 1) Which conclusions were drawn from a given workflow?; 2) Which workflow (run) supports a particular conclusion and which datasets did it use as inputs?; 3) Which different workflows used the same dataset X as input?; 4) Who can be credited for creating workflows that use GWAS data? The answers for the first two questions can readily be found using a simple SPARQL [56] query. Figure 10 shows the SPARQL query and the results as returned by the SPARQL endpoint of the RO Digital Library. Note that in our case we got two result rows, one for each of the workflows that were used to confirm the hypothesis. We emphasize that queries could also be constructed to answer more elaborate questions such as question 3 and 4. Without adding any complexity to the query or the infrastructure, it is possible to query over the entire repository of research objects. This effectively integrates all meta-data of any workflow-based experiment that was uploaded to the RO Digital Library via myExperiment. When more ROs have become available that use the same annotations as described in this paper, then we can start sharing queries that can act as templates. We did not explore further formalization in terms of rejecting or accepting hypotheses, since formulating such a hypothesis model properly would be very domain specific, such as current efforts in neuromedicine [57]. However, the RO model does not exclude the possibility to do so.

Applying the RO model in genomic working environments

An important criterion for our evaluation of the RO model and tools is that it should support researchers in preparing their digital methods and results for publication. We have shown that the RO model can be applied in an existing framework for sharing computational workflows (myExperiment). We used Taverna to create our workflows, and the wf4ever toolkit [58], including dLibra [59] that was extended with a triple store as a back end to store the ROs. The RO features of the test version of myExperiment that we used are currently under development for migration to the production version of myExperiment [60]. Creating an RO in the test version of myExperiment is not any different to a user than the action of creating a pack, completely hiding the creation of RDF objects under the hood. The difference lies in the support of the RO model, which allows the user to add data associated with a computational experiment in a structured way (a sketch representing the experimental setup, the hypothesis document, result files, etc.), and metadata in the form of annotations. Every piece of data in an RO can be annotated, either in a structured or machine-generated way like the automatic annotation of a wfdesc description of a workflow as provided by the workflow-to-RDF transformation service, or manually by the user at the time of resource upload, such as the annotation of an experiment overview as “Sketch”. Since RO descriptions are currently not a pre-requisite to publishing workflow results in journal, we hope that this support and streamlining of the annotation process will act as an incentive for scientists to start using the RO technology.

The representation of an RO in myExperiment as presented in this paper should be seen as a proof-of-concept. Crucial elements of a computational experiment are handled, but there is room for improvement. For example, the hypothesis and conclusions are at the moment only shown as downloadable text files and the content and provenance of a workflow run is not shown to the user. We found that more tooling is needed to make practical use of the provenance trace. It is detailed and focus is on data lineage, rather than the biological meaning of the recorded steps. Nevertheless, we regard this raw workflow data as highly valuable as the true record of what exactly was executed. It allows introspection of the data lineage, such as which service was invoked with exactly which data. By providing this proof-of-concept and the RO model as a reference model, we hope to stimulate developers of other genomic working environments such as Galaxy [6] and Genome Space [61] to start implementing the RO model as well, thus enabling scientists to share their investigation results as a complete knowledge package. Similarly, workflow systems use different workflow languages [62, 63], and by presenting the workflow-to-RDF transformation service that handles the t2flow serialization format to transform a workflow to an RO, we hope to encourage systems that use other workflow languages to develop similar services to transform their workflows to ROs. This would allow for a higher-level understanding of workflow-based experiments regardless of the type of workflow system used.

It should be noted that although our ROs fully capture the individual data items of individual steps within workflow runs, this approach is not applicable to all scientific workflows. In fact, we have since further developed the provenance support for Taverna so that larger pieces of data are only recorded as URI references and not bundled within the ZIP file. The Taverna workflow system already supports working with such references; however many bioinformatics Web services still only support working directly with values. When dealing with references, the workflow run data only capture the URI and its metadata, and full access to the run data therefore would also depend on the continued availability (or mirroring) of those referenced resources, and their consistency would therefore later need to be verified against metadata such as byte size and Secure Hash Algorithm checksums.

Generalization to other domains

We acknowledge that apart from enabling the structured aggregation and annotation of digital ROs technically, scientists appreciate guidelines and Best Practices for producing high quality ROs. In fact, the minimal requirements for a complete RO that we implemented via the Minim model, were inspired by the 10 Best Practices that we defined for creating workflows [39]. An RO may be evaluated using different checklists for different purposes. A checklist description is published as linked data, and may be included in the RO, though we anticipate more common use will be for it to be published separately in a community web site. In our work to date, we have used checklist definitions published via Github (e.g. [64]), and are looking to create a collection of example checklist definitions to seed creation of checklists for different domains or purposes [65]. We envision that instructions to authors of ROs may differ between research communities, and publishers who wish to adopt RO technology for digital submissions may develop their own ‘Instructions to Authors’ for ROs. This could be implemented by different mappings of the Minim model.

Related work

The RO model was implemented as a Semantic Web model to provide a general, domain-agnostic reference that can be extended by domain specific ontologies. For instance, while the RO model offers terms pertaining to experimental science such as “hypothesis” and “conclusion”, extensions to existing models that also cover this area and are already in use in the life science domain could be considered. It is beyond the scope of this article to exhaustively review related ontologies and associated tools, but we wish to mention six that in our view are prime candidates to augment the RO family of ontologies and tools. The first is the Ontology for Biomedical Investigations (OBI) that aims to represent all phases of experimental processes, such as study designs, protocols, instrumentation, biological material, collected data and analyses performed on that data [66]. OBI is used for the ontological representation of the results of the Investigation-Study-Assay (ISA) metadata tools [67] that is the next on our list of candidates. ISA, developed by the ISA commons community, facilitates curation, management, and reuse of omics datasets in a variety of life science domains [68]. It puts spreadsheets at the heart of its tooling, making it highly popular for study capture in the omics domain [69]. The third candidate is the ontology for scientific experiments EXPO [70]. EXPO is defined by OWL-DL axioms and is grounded in upper ontologies. Its coverage of experiment terms is good, but we are unsure about its uptake by the community. Perhaps unfortunate for a number of good ontologies, we consider this an important criterion for interoperability. Four and five on our list relate to the annotation of Web Services (or bioinformatics operations in general): the EMBRACE Data and Methods (EDAM) ontology encompasses over 2200 terms for annotating tool or workflow functions, types of data and identifiers, application domains and data formats [71]. It is developed and maintained by the European Bioinformatics Institute and has been adopted for annotation of for instance the European Molecular Biology Open Software Suite. The myGrid-BioMoby ontology served as a starting point for the development of EDAM. This will facilitate the adoption of EDAM by for instance BioCatalogue,org and service-oriented tools such as Taverna, which would further broaden its user base and thereby its use for interoperability. The Semantic Automated Discovery and Integration (SADI) framework [72] takes semantic annotation of Web Services one step further. A SADI Web Service describes itself in terms of OWL classes, and produces and consumes instances of OWL classes. This enables instant annotation in a machine readable format when a workflow is built from SADI services. In addition, via a SADI registry suggestions can be made about which services to connect to which. SADI has clear advantages as an annotation framework. However, not all bioinformatics services are available as SADI services, while the conversion is not trivial without training in Semantic Web modelling. Therefore, SADI and RO frameworks could be strongly complementary for workflows that use a heterogeneous mix of service types. This would be further facilitated when both are linked to common ontologies such as EDAM. Finally, we highlight the recent development of models for microattribution and nanopublication that aim to provide a means of getting credit for individual assertions and making these available in a machine readable format [73–75]. Taking nanopublications as an example, we could “nanopublish” specific results from our experiment, such as the text mining-based association that we found between the SNP “rs7156144” and the biological process “stimulation of tumor necrosis factor production”. In addition to an assertion, a nanopublication consists of provenance meta-data (to ensure trust in the assertion) and publication information (providing attribution to authors and curators). Nanopublication and RO complement each other in two ways. On the one hand, nanopublications can be used to publish and expose valuable results from workflows and included in the RO aggregate. On the other hand, an RO could be referenced as part of the provenance of a nanopublication, serving as a record of the method that led to assertion of the nanopublication. Similar to the nanopublication and microattribution models, the Biotea and Elsevier Smart Content Initiative data models also aim to model scientific results, but are focused on encapsulating a collection of information that are related to the results reported in publications [76, 77]. The relationship between an RO and these datasets is not much different from an RO with a nanopublication statement. An RO can be referenced by, e.g. the Biotea dataset, by its URI, which can provide detailed experimental information or provenance information about the results described by the Biotea dataset. In the meanwhile, an RO can also reference a Biotea dataset or an Elsevier linked dataset.

Summarizing, the RO model provides a general framework with terms for aggregating and annotating the components of digital research experiments, by which it can complement related frameworks that are already used in the life science domain such as EXPO, OBI, ISA, EDAM, SADI and nanopublication. We observe that models are partly complementary and partly overlapping in scope. Therefore, we stimulate collaboration towards the development of complementarity frameworks. For instance, we initiated an investigation of the combination of ISA, RO, and Nanopublication as a basis for general guidelines for publishing digital research artefacts (Manuscript in preparation).

Uptake by the research community

Beyond the RO presented in this paper, the RO model has been used to generate ROs within the domains of musicology [78] and astronomy using AstroTaverna [79]. In addition, we recently explored how an RO could be referenced as part of the provenance of nanopublications of genes that are differentially expressed in Huntington’s Disease (HD) with certain genomic regions [80, 81]. The results from the in silico analysis of the differentially expressed genes were obtained from a Taverna data integration workflow and the RO itself was stored in the Digital Library. Using the PROV-O ontology, the nanopublication provenance was modelled to link to the workflow description in the RO. Since the RO was mostly automatically generated by the procedure described in this paper, the nanopublication refers to detailed provenance information without requiring additional modelling effort. To encourage further uptake by the research community we have developed the Web resource ResearchObject.org [82]. ResearchObject.org lists example ROs [83], presents the ongoing activities of the open RO community, and gathers knowledge about related developments and adoptions.

Conclusions

Applying the workflow-centric RO model and associated models such as Minim provides a digital method to increase the understanding of bioinformatics experiments. Crucial meta-data related to the experiment is preserved in a Digital Library by structured aggregation and annotation of hypothesis, input data, workflows, workflow runs, results, and conclusions. The Semantic Web representation provides a reference model for life scientists who perform computational analyses and for systems that support this, and can complement related annotation frameworks that are already in use in the life science domain.

Consent

Written informed consent was obtained from Kristina M Hettne to publish her picture in relation to the myExperiment pack "Interpreting GWAS results with pathways and text mining".

References

Chen H, Yu T, Chen JY: Semantic Web meets Integrative Biology: a survey. Brief Bioinform. 2012, 14: 109-125.
Article Google Scholar
Sneddon TP, Li P, Edmunds SC: GigaDB: announcing the GigaScience database. Gigascience. 2012, 1: 11-10.1186/2047-217X-1-11.
Article Google Scholar
Ghosh S, Matsuoka Y, Asai Y, Hsin K-Y, Kitano H: Software for systems biology: from tools to integrated platforms. Nat Rev Genet. 2011, 12: 821-832.
Google Scholar
Beaulah SA, Correll MA, Munro REJ, Sheldon JG: Addressing informatics challenges in Translational Research with workflow technology. Drug Discov Today. 2008, 13: 771-777. 10.1016/j.drudis.2008.06.005.
Article Google Scholar
Wolstencroft K, Haines R, Fellows D, Williams A, Withers D, Owen S, Soiland-Reyes S, Dunlop I, Nenadic A, Fisher P, Bhagat J, Belhajjame K, Bacall F, Hardisty A, de la Nieva Hidalga A, Balcazar Vargas MP, Sufi S, Goble C: The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic Acids Res. 2013, 41 (Web Server issue): W557-W561.
Article Google Scholar
Goecks J, Nekrutenko A, Taylor J, Galaxy Team T: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010, 11: R86-10.1186/gb-2010-11-8-r86.
Article Google Scholar
Goble CA, Bhagat J, Aleksejevs S, Cruickshank D, Michaelides D, Newman D, Borkum M, Bechhofer S, Roos M, Li P, De Roure D: myExperiment: a repository and social network for the sharing of bioinformatics workflows. Nucleic Acids Res. 2010, 38 (Web Server): W677-W682. 10.1093/nar/gkq429.
Article Google Scholar
Mates P, Santos E, Freire J, Silva CT: CrowdLabs: Social Analysis and Visualization for the Sciences. Sci Stat Database Manag. Volume 6809. Edited by: Hutchison D, Kanade T, Kittler J, Kleinberg JM, Mattern F, Mitchell JC, Naor M, Nierstrasz O, Pandu Rangan C, Steffen B, Sudan M, Terzopoulos D, Tygar D, Vardi MY, Weikum G, Bayard Cushing J, French J, Bowers S. 2011, Berlin, Heidelberg: Springer Berlin Heidelberg, 555-564.
Google Scholar
Zhao J, Gomez-Perez JM, Belhajjame K, Klyne G, Garcia-Cuesta E, Garrido A, Hettne K, Roos M, De Roure D, Goble C: Why workflows break - Understanding and combating decay in Taverna workflows. 2012 IEEE 8th International Conference on E-Science (e-Science). 2012, 1-9. doi: dx.doi.org/10.1109/eScience.2012.6404482
Chapter Google Scholar
Rebholz-Schuhmann D, Grabmüller C, Kavaliauskas S, Croset S, Woollard P, Backofen R, Filsell W, Clark D: A case study: semantic integration of gene-disease associations for type 2 diabetes mellitus from literature and biomedical data resources. Drug Discov Today. 2013, 7: 882-889.
Google Scholar
Williams AJ, Harland L, Groth P, Pettifer S, Chichester C, Willighagen EL, Evelo CT, Blomberg N, Ecker G, Goble C, Mons B: Open PHACTS: semantic interoperability for drug discovery. Drug Discov Today. 2012, 17: 1188-1198. 10.1016/j.drudis.2012.05.016.
Article Google Scholar
Wf4Ever Research Object model.http://wf4ever.github.io/ro,
Belhajjame K, Corcho O, Garijo D, Zhao J, Missier P, Newman DR, Palma R, Bechhofer S, Garcia Cuesta E, Gomez-Perez JM, Klyne G, Page K, Roos M, Enrique Ruiz J, Soiland-Reyes S, Verdes-Montenegro L, De Roure D, Goble C: Workflow-Centric Research Objects: A First Class Citizen in the Scholarly Discourse. Proc 2nd Work Semant Publ. Volume 903. 2012, Hersonissos, Crete, Greece: {CEUR} Workshop Proceedings
Google Scholar
Bechhofer S, De Roure D, Gamble M, Goble CA, Buchan I: Research objects: Towards exchange and reuse of digital knowledge. 2010, Raleigh: In Futur Web Collab Sci
Google Scholar
Bechhofer S, Buchan I, De Roure D, Missier P, Ainsworth J, Bhagat J, Couch P, Cruickshank D, Delderfield M, Dunlop I, Gamble M, Michaelides D, Owen S, Newman D, Sufi S, Goble C: Why linked data is not enough for scientists. Futur Gener Comput Syst. 2013, 29: 599-611. 10.1016/j.future.2011.08.004.
Article Google Scholar
De Roure D, Missier P, Manuel J, Hettne K, Klyne G, Goble C: Towards the Preservation of Scientific Workflows. iPress 2011
Roos M, Marshall MS, Gibson AP, Schuemie M, Meij E, Katrenko S, van Hage WR, Krommydas K, Adriaans PW: Structuring and extracting knowledge for the support of hypothesis generation in molecular biology. BMC Bioinformatics. 2009, 10 Suppl 1 (Suppl 10): S9-
Article Google Scholar
Livingston KM, Bada M, Hunter LE, Verspoor K: Representing annotation compositionality and provenance for the Semantic Web. J Biomed Semantics. 2013, 4: 38-10.1186/2041-1480-4-38.
Article Google Scholar
Ciccarese P, Soiland-Reyes S, Belhajjame K, Gray AJ, Goble C, Clark T: PAV ontology: provenance, authoring and versioning. J Biomed Semantics. 2013, 4: 37-10.1186/2041-1480-4-37.
Article Google Scholar
Object Exchange and Reuse (ORE) model.http://www.openarchives.org/ore/1.0/primer.html,
Ciccarese P, Ocana M, Garcia Castro LJ, Das S, Clark T: An open annotation ontology for science on web 3.0. J Biomed Semantics. 2011, 2 (Suppl 2): S4-10.1186/2041-1480-2-S2-S4.
Article Google Scholar
Missier P, Belhajjame K, Cheney J: The W3C PROV family of specifications for modelling provenance metadata. Proc 16th Int Conf Extending Database Technol - EDBT ’13. 2013, New York, New York, USA: ACM Press, 773-
Chapter Google Scholar
Zhao J, Klyne G, Gamble M, Goble CA: A Checklist-Based Approach for Quality Assessment of Scientific Information. Proceedings of the Third Linked Science Workshop co-located at the International Semantic Web Conference. 2013, Sydney, Australia
Google Scholar
Minim checklist service.https://github.com/wf4ever/ro-manager/blob/master/Minim/Minim-description.md,
Taylor CF, Field D, Sansone SA, Aerts J, Apweiler R, Ashburner M, Ball CA, Binz PA, Bogue M, Booth T, Brazma A, Brinkman RR, Michael Clark A, Deutsch EW, Fiehn O, Fostel J, Ghazal P, Gibson F, Gray T, Grimes G, Hancock JM, Hardy NW, Hermjakob H, Julian RK, Kane M, Kettner C, Kinsinger C, Kolker E, Kuiper M, Le Novère N: Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nat Biotechnol. 2008, 26: 889-896. 10.1038/nbt.1411.
Article Google Scholar
Zeggini E, Scott LJ, Saxena R, Voight BF, Marchini JL, Hu T, de Bakker PI, Abecasis GR, Almgren P, Andersen G, Ardlie K, Boström KB, Bergman RN, Bonnycastle LL, Borch-Johnsen K, Burtt NP, Chen H, Chines PS, Daly MJ, Deodhar P, Ding CJ, Doney AS, Duren WL, Elliott KS, Erdos MR, Frayling TM, Freathy RM, Gianniny L, Grallert H, Grarup N: Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat Genet. 2008, 40: 638-645. 10.1038/ng.120.
Article Google Scholar
McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JPA, Hirschhorn JN: Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008, 9: 356-369. 10.1038/nrg2344.
Article Google Scholar
Illig T, Gieger C, Zhai G, Römisch-Margl W, Wang-Sattler R, Prehn C, Altmaier E, Kastenmüller G, Kato BS, Mewes H-W, Meitinger T, de Angelis MH, Kronenberg F, Soranzo N, Wichmann H-E, Spector TD, Adamski J, Suhre K: A genome-wide perspective of genetic variation in human metabolism. Nat Genet. 2010, 42: 137-141. 10.1038/ng.507.
Article Google Scholar
Gieger C, Geistlinger L, Altmaier E, de Angelis M, Kronenberg F, Meitinger T, Mewes H-W, Wichmann H-E, Weinberger KM, Adamski J, Illig T, Suhre K: Genetics meets metabolomics: a genome-wide association study of metabolite profiles in human serum. PLoS Genet. 2008, 4: e1000282-10.1371/journal.pgen.1000282.
Article Google Scholar
Suhre K, Shin SY, Petersen AK, Mohney RP, Meredith D, Wägele B, Altmaier E, Deloukas P, Erdmann J, Grundberg E, Hammond CJ, de Angelis MH, Kastenmüller G, Köttgen A, Kronenberg F, Mangino M, Meisinger C, Meitinger T, Mewes HW, Milburn MV, Prehn C, Raffler J, Ried JS, Römisch-Margl W, Samani NJ, Small KS, Wichmann HE, Zhai G, Illig T, CARDIoGRAM: Human metabolic individuality in biomedical and pharmaceutical research. Nature. 2011, 477: 54-60. 10.1038/nature10354.
Article Google Scholar
Jelier R, Schuemie MJ, Veldhoven A, Dorssers LCJ, Jenster G, Kors JA: Anni 2.0: a multipurpose text-mining tool for the life sciences. Genome Biol. 2008, 9: R96-10.1186/gb-2008-9-6-r96.
Article Google Scholar
Hettne KM, Boorsma A, van Dartel DA, Goeman JJ, de Jong E, Piersma AH, Stierum RH, Kleinjans JC, Kors JA: Next-generation text-mining mediated generation of chemical response-specific gene sets for interpretation of gene expression data. BMC Med Genomics. 2013, 6: 2-10.1186/1755-8794-6-2.
Article Google Scholar
myExperiment alpha.http://alpha.myexperiment.org,
Palma R, Corcho O, Hotubowicz P, Pérez S, Page K, Mazurek C: Digital libraries for the preservation of research methods and associated artifacts. Proc 1st Int Work Digit Preserv Res Methods Artefacts - DPRMA ’13. 2013, New York, New York, USA: ACM Press, 8-15.
Chapter Google Scholar
Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M: KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2012, 40 (Database issue): D109-D114.
Article Google Scholar
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.
Article Google Scholar
KEGG REST services.http://www.kegg.jp/kegg/rest/keggapi.html,
Concept Profile Mining Web services.https://www.biocatalogue.org/services/3559,
Hettne KM, Wolstencroft K, Belhajjame K, Goble CA, Mina E, Dharuri H, De Roure D, Verdes-Montenegro L, Garrido J, Roos M: Best Practices for Workflow Design: How to Prevent Workflow Decay. Proc 5th Int Work Semant Web Appl Tools Life Sci Paris, Fr Novemb 28-30, 2012, Volume 952. 2012, Paris. France: CEUR-WS.org, [{CEUR} Workshop Proceedings]
Google Scholar
Sanderson R, Ciccarese P, Van de Sompel H: Designing the W3C open annotation data model. Proc 5th Annu ACM Web Sci Conf - WebSci ’13. 2013, New York, New York, USA: ACM Press, 366-375.
Chapter Google Scholar
wfdesc vocabulary.https://github.com/wf4ever/ro/blob/master/wfdesc.owl,
wfprov ontology.http://purl.org/wf4ever/wfprov#,
RO terms vocabulary.http://purl.org/wf4ever/roterms,
Minim checklist ontology.http://purl.org/minim/,
Research Object Digital Library Restful API.http://www.wf4ever-project.org/wiki/display/docs/RO+API+6,
Research Object Digital Library SPARQL endpoint.http://sandbox.wf4ever-project.org/portal/sparql?1,
Alper P, Belhajjame K, Goble CA, Karagoz P: Enhancing and abstracting scientific workflow provenance for data publishing. Proc Jt EDBT/ICDT 2013 Work - EDBT ’13. 2013, New York, New York, USA: ACM Press, 313-
Chapter Google Scholar
Research Object in myExperiment.http://www.myexperiment.org/packs/428,
Research Object results.http://alpha.myexperiment.org/packs/405/resources/kegg_cp_comparison_results.xls,
DCMI Usage Board (2012): DCMI Metadata Terms.http://dublincore.org/documents/2012/06/14/dcmi-terms/,
RO checklist document in RDF.https://github.com/wf4ever/ro-catalogue/blob/master/minim/minim-workflow-demo.rdf,
Spreadsheet-based RO checklist document.https://github.com/wf4ever/ro-catalogue/blob/master/minim/minim-workflow-demo.pdf,
Enhancing reproducibility. Nat Methods. 2013, 10: 367-367. doi:10.1038/nmeth.2471
Ince DC, Hatton L, Graham-Cumming J: The case for open computer programs. Nature. 2012, 482: 485-488. 10.1038/nature10836.
Article Google Scholar
Peng RD: Reproducible research in computational science. Science. 2011, 334: 1226-1227. 10.1126/science.1213847.
Article Google Scholar
SPARQL Protocol and RDF Query Language.http://www.w3.org/TR/sparql11-overview/,
Cheung K-H, Kashyap V, Luciano JS, Chen H, Wang Y, Stephens S, Ciccarese P, Wu E, Wong G, Ocana M, Kinoshita J, Ruttenberg A, Clark T: The SWAN biomedical discourse ontology. J Biomed Inform. 2008, 41: 739-751. 10.1016/j.jbi.2008.04.010.
Article Google Scholar
Page K, Palma R, Holubowicz P, Klyne G, Soiland-Reyes S, Cruickshank D, Cabero RG, Cuesta EG, De Roure D, Zhao J: From workflows to Research Objects: an architecture for preserving the semantics of science. Proc 2nd Int Work Linked Sci. 2012
Google Scholar
dLibra.http://dlab.psnc.pl/dlibra/,
myExperiment release schedule.http://wiki.myexperiment.org/index.php/Developer:ReleaseSchedule,
Genome Space.http://www.genomespace.org/,
Tiwari A, Sekhar AKT: Workflow based framework for life science informatics. Comput Biol Chem. 2007, 31: 305-319. 10.1016/j.compbiolchem.2007.08.009.
Article Google Scholar
Romano P: Automation of in-silico data analysis processes through workflow management systems. Brief Bioinform. 2008, 9: 57-68.
Article Google Scholar
Example Minim checklist definition.https://github.com/wf4ever/ro-catalogue/blob/master/v0.1/Y2Demo-test/workflow-experiment-checklist.rdf,
Collection of example Minim checklist definitions.https://github.com/wf4ever/ro-catalogue/tree/master/minim,
Brinkman RR, Courtot M, Derom D, Fostel JM, He Y, Lord P, Malone J, Parkinson H, Peters B, Rocca-Serra P, Ruttenberg A, Sansone S-A, Soldatova LN, Stoeckert CJ, Turner JA, Zheng J: Modeling biomedical experimental processes with OBI. J Biomed Semantics. 2010, 1 (Suppl 1): S7-10.1186/2041-1480-1-S1-S7.
Article Google Scholar
Rocca-Serra P, Brandizi M, Maguire E, Sklyar N, Taylor C, Begley K, Field D, Harris S, Hide W, Hofmann O, Neumann S, Sterk P, Tong W, Sansone S-A: ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level. Bioinformatics. 2010, 26: 2354-2356. 10.1093/bioinformatics/btq415.
Article Google Scholar
Sansone S-A, Rocca-Serra P, Brandizi M, Brazma A, Field D, Fostel J, Garrow AG, Gilbert J, Goodsaid F, Hardy N, Jones P, Lister A, Miller M, Morrison N, Rayner T, Sklyar N, Taylor C, Tong W, Warner G, Wiemann S: The first RSBI (ISA-TAB) workshop: “can a simple format work for complex studies?”. OMICS. 2008, 12: 143-149. 10.1089/omi.2008.0019.
Article Google Scholar
Maguire E, González-Beltrán A, Whetzel PL, Sansone S-A, Rocca-Serra P: OntoMaton: a bioportal powered ontology widget for Google Spreadsheets. Bioinformatics. 2013, 29: 525-527. 10.1093/bioinformatics/bts718.
Article Google Scholar
Soldatova LN, King RD: An ontology of scientific experiments. J R Soc Interface. 2006, 3: 795-803. 10.1098/rsif.2006.0134.
Article Google Scholar
Ison J, Kalas M, Jonassen I, Bolser D, Uludag M, McWilliam H, Malone J, Lopez R, Pettifer S, Rice P: EDAM: An ontology of bioinformatics operations, types of data and identifiers, topics, and formats. Bioinformatics. 2013, 29: 1325-1332. 10.1093/bioinformatics/btt113.
Article Google Scholar
Wilkinson MD, Vandervalk B, McCarthy L: The Semantic Automated Discovery and Integration (SADI) Web service Design-Pattern. API and Reference Implementation J Biomed Semantics. 2011, 2: 8-
Article Google Scholar
Patrinos GP, Cooper DN, van Mulligen E, Gkantouna V, Tzimas G, Tatum Z, Schultes E, Roos M, Mons B: Microattribution and nanopublication as means to incentivize the placement of human genome variation data into the public domain. Hum Mutat. 2012, 33: 1503-1512. 10.1002/humu.22144.
Article Google Scholar
Mons B, Van Haagen H, Chichester C, Hoen ’t P-B, Dunnen JT D, Van Ommen G, Mulligen EM V, Singh B, Hooft R, Roos M, Hammond J, Kiesel B, Giardine B, Velterop J, Groth P, Schultes E, Den Dunnen JT: The value of data. Nat Genet. 2011, 43: 281-283. 10.1038/ng0411-281.
Article Google Scholar
Nanopublication schema.http://nanopub.org/nschema,
Garcia Castro L, McLaughlin C, Garcia A: Biotea: RDFizing PubMed Central in support for the paper as an interface to the Web of Data. J Biomed Semantics. 2013, 4 (Suppl 1): S5-10.1186/2041-1480-4-S1-S5.
Article Google Scholar
data.elsevier.com.http://data.elsevier.com/documentation/index.html,
Page KR, Fields B, De Roure D, Crawford T, Downie JS: Capturing the workflows of music information retrieval for repeatability and reuse. J Intell Inf Syst. 2013, 41: 435-459. 10.1007/s10844-013-0260-9.
Article Google Scholar
Garrido J, Soiland-Reyes S, Enrique Ruiz J, Sanchez S: AstroTaverna: Tool for Scientific Workflows in Astronomy. Astrophys Source Code Libr. 2013,http://ascl.net/1307.007,
Google Scholar
Mina E, Thompson M, Zhao J, Hettne K, Schultes E, Roos M: Nanopublications for exposing experimental data in the life-sciences: a Huntington’s Disease case study. SWAT4LS, volume 1114 of CEUR Workshop Proceedings, CEUR-WS.org. 2013, Edinburgh
Google Scholar
Huntington’s Disease study Research Object.http://sandbox.wf4ever-project.org/rodl/ROs/data_interpretation-2/,
ResearchObject.org.http://www.researchobject.org/,
Research Object examples.http://www.researchobject.org/initiative/,

Download references

Acknowledgements

The research reported in this paper is supported by the EU Wf4Ever STREP project (270129) funded under EU FP7 (ICT-2009.4.1), the EP/G026238/1 EPSRC project myGrid: A Platform for e-Biology Renewal, the IMI-JU project Open PHACTS (grant agreement n 115191), and grants received from the Netherlands Bioinformatics Centre (NBIC) under the BioAssist program.

We gratefully acknowledge Matt Gamble for his advice on the Minim model.

Author information

Authors and Affiliations

Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
Kristina M Hettne, Harish Dharuri, Eleni Mina, Mark Thompson, Reinout van Schouwen, Peter A C ‘t Hoen & Marco Roos
School of Computer Science, University of Manchester, Manchester, UK
Katherine Wolstencroft, Khalid Belhajjame, Stian Soiland-Reyes, Sean Bechhofer & Carole Goble
Department of Zoology, University of Oxford, Oxford, UK
Jun Zhao, Don Cruickshank, David de Roure & Graham Klyne
Ontology Engineering Group, Universidad Politécnica de Madrid, Madrid, Spain
Oscar Corcho
Instituto de Astrofísica de Andalucía, Granada, Spain
Lourdes Verdes-Montenegro & Julian Garrido
Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands
Katherine Wolstencroft

Authors

Kristina M Hettne
View author publications
You can also search for this author in PubMed Google Scholar
Harish Dharuri
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Katherine Wolstencroft
View author publications
You can also search for this author in PubMed Google Scholar
Khalid Belhajjame
View author publications
You can also search for this author in PubMed Google Scholar
Stian Soiland-Reyes
View author publications
You can also search for this author in PubMed Google Scholar
Eleni Mina
View author publications
You can also search for this author in PubMed Google Scholar
Mark Thompson
View author publications
You can also search for this author in PubMed Google Scholar
Don Cruickshank
View author publications
You can also search for this author in PubMed Google Scholar
Lourdes Verdes-Montenegro
View author publications
You can also search for this author in PubMed Google Scholar
Julian Garrido
View author publications
You can also search for this author in PubMed Google Scholar
David de Roure
View author publications
You can also search for this author in PubMed Google Scholar
Oscar Corcho
View author publications
You can also search for this author in PubMed Google Scholar
Graham Klyne
View author publications
You can also search for this author in PubMed Google Scholar
Reinout van Schouwen
View author publications
You can also search for this author in PubMed Google Scholar
Peter A C ‘t Hoen
View author publications
You can also search for this author in PubMed Google Scholar
Sean Bechhofer
View author publications
You can also search for this author in PubMed Google Scholar
Carole Goble
View author publications
You can also search for this author in PubMed Google Scholar
Marco Roos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco Roos.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

KMH participated in the design and the coordination of the study, participated in the design and creation of the workflows and the Web services used in the text-mining workflow, created the Research Object, and drafted the manuscript. HKD participated in the design of the study, participated in the design and creation of the workflows, and helped to draft the manuscript. JZ, KB, SSR, OC, GK, SB participated in the design of the semantic models and helped to draft the manuscript. KW, CG, LVM, JG, DR, PBH participated in the design of the study and helped to draft the manuscript. CG also prepared and co-supervised the work on Minim. EM performed the connection to the nanopublication model and helped to draft the manuscript. MT designed and performed the SPARQL queries and helped to draft the manuscript. DC implemented the requirements for creating a Research Object in myExperiment and helped to draft the manuscript. RS designed and implemented the web services used by the text-mining workflow. MR conceived of the study, participated in its design and coordination, and helped to draft the manuscript. All authors read and approved the final manuscript.

Electronic supplementary material

Additional file 1:Research Object results. KEGG and Concept Profile Analysis comparison results. (XLS 10 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Hettne, K.M., Dharuri, H., Zhao, J. et al. Structuring research methods and data with the research object model: genomics workflows as a case study. J Biomed Semant 5, 41 (2014). https://doi.org/10.1186/2041-1480-5-41

Download citation

Received: 13 May 2013
Accepted: 29 July 2014
Published: 18 September 2014
DOI: https://doi.org/10.1186/2041-1480-5-41

Structuring research methods and data with the research object model: genomics workflows as a case study

Abstract

Background

Results

Conclusions

Availability

Background

Case study: genome wide association studies

Methods

Workflow development

The RO core model

Support for workflow-centric ROs

Support for domain-specific information

Minim model for checklist evaluation

RO digital library

Workflow-centric RO creation process

Creating an RO

Adding the experiment sketch

Adding the hypothesis

Adding workflows

Adding the workflow input file

Adding the workflow provenance

Adding the results

Adding the conclusions

Intermediate step: checklist evaluation

Annotating and linking the resources

Results

Aggregated resources

Checking for completeness of an RO: application of the Minim model

Discussion

Applying the RO model in genomic working environments

Generalization to other domains

Related work

Uptake by the research community

Conclusions

Consent

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Biomedical Semantics

Contact us