Overview of the SEE approach
The SEE approach for representing evidence consists of providing (i) a formal representation of scientific claims, their provenance and the argumentative structure used to justify them by other claims, (ii) a formal representation of claim content and (iii) a coherent integration of the two. SEE relies on an abstract model for the representation of claims, provenance and argumentative structure specified in the Reasoning and Discourse Ontology (RDO), a lightweight OWL vocabulary developed for this purpose. Claim content e.g., what is claimed regarding the properties of biological entities or the results and methods of an investigation is represented in RDF graphs by using appropriately defined semantic web resources and design patterns which as a best practice should, if possible, be re-used from existing domain ontologies. The connection between claims as representational primitives and their content relies on named RDF graphs [21] which enable pointing to collections of RDF-triples or OWL-axioms serialized as such.
After outlining general requirements and design principles for representation of evidence we describe the RDO. We then demonstrate the application and design patterns of the SEE approach in a case study generating an expressive representation of evidence reported in the literature for the location of the enzyme glutamine synthetase.
Deriving design principles and requirements for representation of evidence
We posit two design principles for the representation of evidence and explain their rationale in the following:
DP1: Representation of evidence amounts to representation of claims and argumentative structure.
DP2: Evidence relations in the sense of "A is evidence for B" obtain between the things being claimed.
Accounts of evidence are directed towards the justification of scientific claims. The SEE approach is based on the notion that scientific claims put forward possible, more or less likely scenarios and outcomes - states of affairs [22] - as being accurate descriptions of a subject of scientific inquiry. Something is evidence for a certain state of affairs, if and only if it gives reason to believe that this state of affairs in fact obtains [23]. A pairing of evidence and what it is claimed to be evidence for therefore corresponds to the set of premises and the conclusion of an argument in which the truth of the premises alleges to give reason to believe the conclusion is true. Therefore the evidence used by authors or agents to justify a claim, possibly using further unstated background assumptions, can be mirrored by an argumentative structure having the claim as its conclusion. Typically, what is used to justify the authors conclusions within this argumentative structure are claims in themselves accepted as true on the basis of observations or inferences of the same or of other investigators. SEE, therefore, models evidence relations in the sense of "A is evidence for B" specifically as relations between claims.
We derive two additional requirements:
DP3: A researcher's assessment of the evidence for a finding usually includes evaluation of which materials and methods were used, what kind of data was obtained and which properties were observed, inferred or assumed to establish the finding. Consequently, a representation of the materials, methods, data items and other elements forming the subject of a claim should be part of a computationally accessible evidence representation. In RDF and OWL the subject of a claim, a state of affairs, must be expressed, using appropriately defined resources, as (one or more) triples and axioms, respectively. It follows then, in accordance with DP2 that in an RDF/OWL-based representation of evidence that includes claim subjects the representation of evidential relationships should operate between claim subject representations, i.e. between sets of RDF-triples and/or OWL-axioms.
DP4: Representation of claims and hence representation of evidence must take into account claim provenance, in particular through which source and by which agents the claims were made. Knowing which agent made the claim is crucial for evaluating independence and reproducibility. Tracking the original source of a claim provides a natural reference point for all subsequent representations of the claim and its supporting background and for re-evaluation of the claim within the original context in which it was communicated.
We therefore identify as minimal components for modelling evidence elements representing (i) scientific claims and the argumentative structure used to justify them by other claims, (ii) the subjects of the claims i.e. that what is claimed with regard to a subject of inquiry, (iii) the agents making the claims and arguments, (iv) the sources in which claims were originally made e.g., the original scientific articles or database records.
Reasoning and Discourse Ontology (RDO)
Based on the foregoing we developed an abstract model for representation of evidence in terms of claims, their argumentative structure and their provenance. It is specified here as the Reasoning and Discourse Ontology (RDO) using the Web Ontology Language (OWL). This section outlines the core classes and properties of RDO. Full, formal specification of all RDO constructs is provided in the ontology file provided as additional file 1.
The typical scenario that underlies the constructs defined in RDO is the following: Agents (e.g., individual scientists) make claims on particular occasions (e.g., as authors of a published scientific article) about a subject of inquiry. The subject of the claim - i.e. what is claimed - is communicated in some linguistic form, often as part of a more comprehensive report (e.g., a scientific article) authored by the agents. Claims are usually justified by other claims the subject of which has been accepted as true, usually on the basis of yet other claims. RDO (Figure 1) rests on the distinction of a claim, its subject and the linguistic form in which this subject is communicated and is centered around the concept of an assertion [24]: instances of the class assertion (courier typeface denotes OWL classes, courier in italics denotes OWL properties) represent particular claims made by particular agents on a particular occasion that a particular proposition, the subject of the claim, is true. Propositions, in our model, are represented by the class proposition and taken to represent the semantic content of contextualized lexical entities formulated in some natural or artificial language [25]. The lexical entities by which the subject of a claim and propositions and reports in general are formulated are represented using the class text. Further core classes are report representing accounts intended to accurately describe an event or situation. Thus, scientific journal articles or database records as typical sources of assertions are examples of a rdo:report. Agent is used to represent individual persons, corporate bodies or information processing devices as roleplayers in the creation of reports or assertions. RDO specifies various properties to represent the relations between instances of these classes (Figure 1). In particular, argumentative structure is captured by the property is inferred from which relates an instance of assertion to another if and only if the former is, directly or indirectly, inferred from the latter (and possibly other premises).
Application: representation and evaluation of evidence for a source of glutamine synthetase
Introducing the case study
We applied SEE to generate a computationally accessible, expressive and extensible account of evidence gathered in the literature regarding a claimed source of the enzyme glutamine synthetase (GS). We have chosen this particular test case because obtaining reliable information on location of enzyme activities is a subject area of particular importance for systems biology approaches such as the reconstruction of cell-type specific [26] or organism-level [27] metabolic networks. Furthermore, it embodies the typical task of acquiring knowledge on a subject of inquiry by extracting and combining evidence from different sources.
Starting point is our evaluation of a scientific journal article [28] (referred to as 'Meister 1985' in the following) authored by Alton Meister which asserts in the second paragraph of the text, among other things, that the enzyme glutamine synthetase (GS) was isolated from rat liver. This assertion is based, by way of citation, on the contents of another article by Tate, Leu and Meister [29] (referred to as 'Tate 1972' in the following). In Tate 1972 the isolation of GS from rat liver is reported. The finding is reported to be based on an investigation which involved, among other things, extraction of rat livers, protein purification and γ-glutamyl hydroxamate synthesis (γ-GHS) assays. In the following we show how this context is formalized using the SEE approach to yield a detailed formal account of the evidence presented through these articles for rat liver as source of GS. In doing so, we illustrate various design patterns used in SEE for representing the relevant items. For clarity assertion instances will be indexed as A1, A2, and so forth.
Representing the evidence
Figure 2 shows how the assertion from Meister 1985 that GS was isolated from rat liver is represented using RDO, exemplifying the design pattern used to represent the relations between a particular assertion and its subject and provenance: The article itself, Meister 1985, is classified as instance of report annotated with a uniform resource locator (URL) providing its digital representation. The second paragraph of Meister 1985 constitutes a report_part. It is expressed as the English language text as which it is written and which is represented as an instance of text. The original text is linked to it via the data property has_lexical_structure. Meister's claim that glutamine synthetase was isolated from rat liver contained in this paragraph is represented by an instance of assertion (A1) labelled as '! some GS-enzyme isolated from some rat liver ! AM' to indicate the assertion subject in a concise, human readable manner (formalization of assertion subjects is described below). A1 is related to a corresponding instance of proposition identifying the subject of the claim, to an instance of agent representing Alton Meister, and to said report part by the properties asserts, is_assertion_made_by and is_assertion_made_in, respectively.
Claims which reiterate previous findings are represented as assertions on the same subject made by the respective agents. Formally, the reiterating claim is represented as an assertion instance which is linked to the source assertions by is_directly_inferred_from and linked to the same proposition instance as the source assertions by asserts. Each assertion can be linked to its corresponding agents and reports. Application of this design pattern to our case study is shown in Figure 3: The fact that Meisters assertion (A1) reiterates what Tate & co-workers have asserted on the isolation of GS from rat liver (A2), is represented by a relation of the former to the latter via is_directly_inferred_from and by sharing the same proposition instance via asserts.
The argumentative structure within and across the publications is represented as a series of assertion instances and is_directly_inferred_from relations with additional links to represent assertion subjects and provenance (Figure 4). The assertion instances linked to A2 reflect the results and the reasoning of the authors at various steps of their investigation based on a careful analysis of the internal argumentative structure of Tate 1972. Specifically, Tate et al.'s main conclusion that GS-enzyme was isolated from rat liver (A2) is essentially based on asserting that (A3) there is a biological sample (labelled 'sample-1') which has GS-activity, that (A4) any GS-activity is borne by some GS-enzyme and that (A5) sample-1 was isolated from some rat liver (precise definitions for GS-enzyme, GS-activity in the context of the case study are detailed in additional file 2). The joint use of A3, A4 and A5 to infer A2 is made explicit by using the has_conjunctive_part property to link them to the same composite assertion instance which in turn is related to A2 using the is_directly_inferred_from property. This pattern is used whenever an assertion is inferred from more than one premise. A3, the assertion that sample-1 has GS-activity is justified in turn by asserting that (A6) it was input to a particular assay (labelled assay-1), that (A7) this assay produced a particular result, data item 1, and that (A8) this data item is a measurement of some GS-activity. A8, in turn, is justified by asserting that (A9) the data item is output of assay-1, that (A10) this assay was a γ-GHS assay, and that (A11 & A12) this type of assay is suited to measure GS-activity. Some assertions are not further justified, either because they reflect factual descriptions in Tate 1972 (A9, A10), represent general assumptions of the authors (A11) or are expressions of terminological domain knowledge (A12, A4). A5 exhibits a similar justification trail, as shown in Figure 4. Full, formal representation of the argumentative structure for the test case is provided in additional file 2.
The prevalent pattern in SEE for recording individual and logically relevant steps of an investigation is for any such step to link its outcomes (data or material), the techniques used to produce these outcomes, and their objectives as exemplified in the composite assertions comprising assertions A9-A12 and A15-A20 (Figure 4). In A9-A12, for example, the experimental process type (γ-GHS assay) is linked to the objective of its application (GS-activity measurement) and in turn to the quality that is intended to be determined (GS-activity). Generally, the relations between these ontologically different entities are not trivial and not one-to-one (one objective can consist of the determination of several qualities recognized in a scientific domain, a certain quality can be the subject of inquiry in several objectives). However, in this particular case the objective and quality are narrowly defined and directly correlated.
Representation of assertion subjects
The representation of argumentative structure and claim provenance as an interrelated set of assertion instances described so far is complemented by a structured representation of what is asserted in each assertion, the assertion subject. To this end each assertion instance is linked to a corresponding proposition instance the IRI (Internationalized Resource Identifier) of which identifies a named RDF graph. This graph provides a structured representation of the assertion subject using appropriately defined resources (Figure 5). This setup enables querying the elements forming the assertion subject. In assertion A10, shown in Figure 5 as an example, it is asserted by Tate and co-workers that the particular assay they performed was a γ-GHS assay. The representation of this statement as a graph identified by the IRI of the proposition instance linked to the assertion instance representing A10 enables to access the entities A10 is about: the particular assay, its asserted type, and the typing relation itself. Full specification of all propositions as named graphs in the context of the case study is provided in additional file 3.
To generate the graph representations of the assertion subjects, the natural language expressions of the assertions identified in the Meister 1985 and Tate 1972 reports were formalized in RDF using appropriately defined resources (see additional files 2, 3 and 4). Most assertion subjects could be formalized in a straightforward manner applying OWL 2 RDF-based semantics [30]. The principal claim that "glutamine synthetase was isolated from rat liver" which is the common subject of assertions A1 and A2 was formalized in RDF by instantiating the class gs_enzyme and is_isolated_from some rat_liver (shown as :proposition-1 in additional file 3). This exemplifies instantiation of the OWL-class (A and related_to some B) as a design pattern for formalization of statements which can, in natural language, be represented in the form "some A related to some B" (A B denoting OWL-classes used to represent the types A and B, respectively and related_to denoting an OWL-property used to represent the relation among some of their instances).
Labels of assertion and corresponding proposition instances are directly derived from the graph representation of the assertion subject (see methods section). In particular, the label "some A related_to some B" is used for proposition instances that represent statements of the form "some A related to some B" by applying the design pattern described above.
Representing consecutive layers of interpretation and own conclusions
We use the test case to specify additional design patterns to represent activity of a curator or generally of a third party evaluating a scientific report. Our representation of the evidence in the Meister 1985 and Tate 1972 reports is the result of the interpretation by another agent (Christian Bölling - CB). This can be explicitly represented in SEE using its familiar design pattern for propositions and assertions. For example, the claim that Tate et al. indeed assert that the assay they performed was a γ-GHS assay in their 1972 publication can be represented as an assertion instance in its own right, made by another agent, CB (Figure 6). This pattern allows for representing arbitrary many consecutive layers of interpretation or attribution.
So far the presented account consists of assertions attributed to the authors of the Meister 1985 and Tate 1972 reports, i.e. a representation of what these authors assert. SEE also provides the resources to append own conclusions. For example, an agent, CB, could upon evaluation of the claims made by Tate et al. conclude for himself that GS was indeed isolated from rat liver. This is represented as an assertion instance in its own right (A30, labelled '! some GS-enzyme isolated from some rat liver ! CB'). It is linked to the corresponding proposition via assertsand the assertions made by Tate et al. viais_directly_inferred_from. We describe two semantically different patterns to make this connection. In pattern 1 assertion A30 is linked to assertion A2 (Figure 7). In pattern 2 (Figure 8) A30 is linked to a new composite assertion that involves two more curator assertions (A31, A32) and A4 as a representation of terminological domain knowledge. A31 and A32 are linked by is_directly_inferred_from to composite assertions reflecting factual descriptions of data and procedures given in Tate 1972. There is a subtle, yet important difference in meaning between these two representations. In pattern 1 CB's conclusion is based on Tate et al.'s assertion on the same subject, i.e., it is based on the author statement itself and does not necessarily imply an affirmation of how Tate et al. reached their conclusion. In pattern 2 the curator inference is based on factual descriptions in Tate 1972, i.e., it affirms the conclusions of Tate et al. as own conclusions on the basis of the reported experimental results.
Evaluation of a given set of data might also lead to conclusions different from those of the authors. Such alternative interpretations can be represented using SEE. For example, one might dispute that γ-GHS assays are suited to measure GS-activity (EC 6.3.1.2). The γ-GHS assay works by measuring the formation of L-γ-glutamyl hydroxamate rather than glutamine [31]. Tate et al. assert as the objective of its application GS-activity measurement, accepting the formation of the hydroxamate under the conditions of the assay as a proxy for the formation of glutamine and the actual reaction mechanism. Assertion A11 using the property achieves_objective reflects this acceptance by Tate et al.. Alternatively, a third party could assert that γ-GHS assays merely achieve the less specific objective of measuring γ-glutamyl transferase (GGT) activity (EC 2.3.2.2) (Figure 9, assertion A45). In this case the data reported by Tate et al. can still be used to infer that rat liver is a source of GGT-enzyme (Figure 9, assertion A40).
Evaluating the test case evidence representation
The test case evidence representation that was created using the RDO constructs and design patterns was evaluated in terms of its potential to answer, within the confines of the case study, a list of competency questions reflecting different aspects of the evidence a researcher investigating glutamine synthetase knowledge would be interested in:
Q1: Which locations of GS have been asserted?
Rat liver.
Q2: Where has rat liver GS been reported?
The Meister 1985 and Tate 1972 reports.
Q3: Do the assertions made in these reports pertain to independent observations?
No. Meister's assertion is based on Tate et al.'s assertion. Moreover, some of the authors of the two reports are identical.
Q4: Is there experimental evidence and where is it described?
Yes. In the Tate 1972 report.
Q5: Which observations and techniques were used for establishing rat liver as GS source?
1. extraction of a protein sample from rat liver (technique: TLM purification)
2. that sample has GS-activity (technique: γ-GHS assay)
Q6: Did Tate et al. really make these observations and conclusions? Who created this account of their findings?
Christian Bölling.
Based on the SEE design patterns, these questions could be formulated as SPARQL [32] queries and successfully answered (see additional file 5). In each of Q1-Q6 the structured representation of assertion subjects as named graphs, besides the other SEE design patterns, is used to identify assertions which are relevant to answer the query. For answering Q1 assertions are identified whose subject's graph representation includes a graph pattern indicative for the isolation of GS from some location (Figure 10A). For answering Q3, pairs of assertions are identified whose subjects share the same graph representation and where one is inferred from the other (Figure 10B).
The following evidence-related information can be queried exploiting property chains and other axioms defined for the RDO constructs:
-
all assertions which are directly or indirectly used to infer a given assertion
-
all assertions made in a given report
-
all assertions made by a given agent
-
all assertions on the same subject
-
all agents making assertions on a given subject
For the corresponding queries see additional file 5. As an example, in Figure 11 the object property assertions inferred for assertion A1, Meister's assertion that GS was isolated from rat liver, are shown. These inferences, simply derived in Protégé 4 with HermiT 1.3.8 as a reasoner include all assertions which A1 is directly or indirectly inferred from and all reports and texts A1 is based on.