Describing sequence annotation instances
Our starting point for modeling sequence annotations was the BED format, a widely used table-based format for sequence annotations that is easy to use and efficient to store (see Figure 1). It typically consists of rows with a reference (e.g. a chromosome identifier), start and end position on that reference, and a value for the annotation. Most UCSC genome browser annotations can be downloaded as BED tracks. We started by deriving our RDF model from the BED format: (i) we identified the desired upper ontological framework for the domain of interest; (ii) we converted data in the BED track to RDF triples; (iii) we further transformed the resulting triples by adding class definitions and ontology mappings to the final model. We describe these steps below:
Upper ontological framework
We chose to use the BFO (version 1.1) as our top-level ontological framework. We augmented BFO with a minimal Reference Sequence Annotation (RSA) ontology to capture classes and predicates, and defined alignment strategies for RSA with OBO.
Data transformation to triples
As a preparative step, we first created annotation instances that closely matched our original data format. We created a 'naive' model for sequence annotation to directly translate the information in the BED file with the addition of the reference assembly name (Figure 2). Predicates linking the resource and its property values were derived from the BED format description. At this stage, we used rdfs:Literal to capture concepts without further ontological grounding (i.e., rdf:type relations). This data-centric approach to semantic modeling is similar to the 'syntactic' conversion that is often used for integration of non-RDF resources, where table values are converted to literals, and table names and headers to classes and properties without any further semantic modelling [29]. These naive models usually have limited semantic depth, such that finding common elements for integration with other data sources can be difficult. Therefore, the model is often linked to a more sophisticated, or personal model. In our case, we used the naive model as a starting point in the modeling process, replacing it step by step by a more precise model (Figure 3). Content of rdfs:Literals from the naive model were thus converted to owl:instances, and class definitions were added. Below, we discuss our derivation of the new model step-by-step, while explaining the placement of new RSA classes and predicates, the reuse of existing ontologies, and potential problems with OBO alignment. An RDF representation of the final model is shown as follows:
@prefix xsd: <> .
http://www.w3.org/2001/XMLSchema#
@prefix rsa: <> .
http://rdf.biosemantics.org/ontologies/rsa#
@prefix hg19: <> .
http://rdf.biosemantics.org/data/genomeassemblies/hg19#
@base <> .
http://rdf.biosemantics.org/examples/sequence_annotation#
:transcript a rsa:SequenceAnnotation ;
rsa:refseqID "NM_001005484";
rsa:isAnnotatedAt :location .
:location a rsa:AnnotationLocation ;
rsa:start "69090"^^xsd:int ;
rsa:end "70008"^^xsd:int ;
rsa:mapsTo hg19:chr1 ;
rsa:hasOrientation rsa:forward
hg19:chr1 a rsa:ReferenceSequence ;
ro:integral_part_of hg19:assembly .
hg19:assembly a rsa:ReferenceAssembly .
Modeling locations on a reference sequence
We considered two approaches to harmonizing genomic location across different reference sequences. On the one hand, one may consider the location as an integral part of the annotation. That is, if the location is changed, the annotation becomes a different annotation. For example, variant annotations generally include the location of the annotation as part of the identifier. Thus, the change of location results in change of identifier of the annotation. On the other hand, a location can be considered an instance separate from the annotation. In this way, a single annotation can be associated with multiple locations and a single location can be associated with multiple annotations. In our example, the second approach is more appropriate because it provides a mechanism to link an annotation to locations on different reference sequences and sequence assemblies. Therefore, we created an instance of RSA:AnnotationLocation :region, as the subject of positional properties. We defined the instance of hg19:assembly and hg19:chr1 as ro:integral_part_of hg19:assembly. We linked :region to hg19:chr1, which indirectly linked this annotation with the reference assembly.
In the example shown in Figure 3, we kept the :start and :end as rdfs:Literals. It is also possible to convert the values of :start and :end to rdfs:Resource, and assign values to these resources. However, we argue that :start and :end should be treated as data type properties of a region. By doing so, we discourage linking of other RDF resources to :region boundaries and the smallest linkable resource remained to be :region. Furthermore, in practice, using rdfs:Resource to describe the start and end of a region (simply two numbers) leads to an explosion of triples. Hence, our model expresses instance data in its simplest form. In contrast, FALDO defines :start and :end as instances of faldo:Positions. It uses more triples (12 instead of 2) to describe the two points. A benefit of FALDO's approach is that it gives more flexibility to describe fussy regions.
Model strand-ness of sequence features
In contrast to RNA and protein, the stranded-ness of DNA sequences needs to be addressed when modeling DNA sequence annotations. Because the two DNA strands are the reverse complement of each other, information encoded in one orientation can be derived from the other strand. Consequently, sequence records in DNA databases contain only one of the two DNA strands (as the other stand can be inferred), but this does not necessarily mean this is the strand an annotation pertains to. We have to take this into account when modeling the strand in an annotation.
When annotations are only linked to the reverse strand of a reference sequence, there are two conceptual annotation models. "Reverse strand annotations" can be understood either as annotations on a sequence that is the reverse complement of a reference sequence, or they can be understood as annotations on the reference sequence, but their interpretations are based on the reverse complement. In the first conceptualization, we need to link an annotation instance to a new sequence instance that is the reverse complement of a known reference sequence. In the second conceptualization, the reverse-ness is a quality of the annotation similar to length being a quality of a region. In practice, most sequence annotation systems specify coordinates using one strand as reference (the forward strand) and "strand" or "orientation" to indicate which strand an annotation pertains to. Thus, the stranded-ness in our example data refers to how annotations can be interpreted on single strand sequences. We modeled this in our example RDF with :hasOrientation :forward.
We further argue that orientations of annotated regions are not limited to :forward and :reverse. If an annotation represents a sequence feature of both strands, such as a CpG Island, we consider the orientation of an annotation as :bidirectional. If the reference sequence is a syntactic sequence representing single strand molecules (RNA, protein), or if the sequence feature does not rely specifically on the underlying sequence (as in the case of a specific binding or chromatin features), the annotation orientation is :none. As a result, the class for annotation orientation is defined as an enumeration of four disjoint instances.
RSA:Orientation subclass of
{ RSA:forward, RSA:reverse, RSA:bidirectional, RSA:none } .
RSA classes and alignment with OBO
We have created instances using five classes from RSA. To enable better integration of our data to existing linked data, we considered how to align RSA classes with OBO classes.
RSA:SequenceAnnotation can be regarded as an SO:sequence_feature with an annotated location on a reference sequence. However, SO is currently not directly aligned with BFO, although this is an ongoing effort [25]. To further improve the consistency and interoperability of SO, new approaches to the BFO alignment were proposed. Terms in SO could be distinguished as either molecular sequences (BFO:independent_continuant, IC) or abstract sequences (BFO:generically_dependent_continuant, GDC) representing molecular sequences [27]. This distinction provides a foundation for the alignment between SO and BFO. Following the same alignment strategy, we chose to refer to SO:sequence_feature as a subclass of GDC. While it is not necessarily true that all terms under SO:sequence_feature can be GDCs, it is outside the scope of this paper to define which section of SO:sequence_feature falls into GDC. Because RSA:SequenceAnnotation is an information entity, we also considered to use IAO:information_content_entity as its super class. However, it is not clear to us whether a class can be the subclass of both SO:sequence_feature and IAO:information_content_entity, because the definition of SO:sequence_feature under GDC is still under discussion. We therefore defer the alignment between RSA:SequenceAnnotation and IAO:information_content_entity to the alignment between SO and IAO. Meanwhile, IAO provides a useful link between database row instances and annotation instances. For example, an instance of RSA:SequenceAnnotation can be the object of IAO:is_about.
To summarize, we defined RSA:SequenceAnnotation as a subclass of BFO:generially_dependent_continuant, and in particular,
RSA:SequenceAnnotation subclass of
SO:sequence_feature and
RSA:isAnnotatedAt some RSA:AnnotationLocation
RSA:AnnotationLocation is a constraint on a reference sequence in terms of location and orientation, with data properties such as a start point and an end point. We argue that it should be classified as a GDC in BFO, because it cannot exist outside the context of an annotation of a reference sequence. However, this prevents alignment with other relevant classes in OBO. For instance, OGI:Biological_interval provides the location properties and it defines relationships between two instances of intervals such as by OGI:isLocatedBefore. Nevertheless, OGI:Biological_Interval is defined as the "spatial continuous physical entity" and a subclass of BFO:object, and thus a subclass of IC. In the context of sequences, this defines an interval as a molecular sequence. Therefore, we only defined relationships between the orientation, the reference, and the annotation location in the scope of RSA.
RSA:AnnotationLocation subclass of
RSA:hasOrientation some RSA:Orientation and
RSA:mapsTo some RSA:ReferenceSequence
RSA:ReferenceSequence is about biological sequences, and modeling biological sequences in ontologies is not easy [26]. In RSA, we defined RSA:ReferenceSequence as a syntactic sequence. This is an information-bearing entity that contains a series of letters from a given alphabet (i.e., ATGC for DNA). It can represent sequential information captured by a biological molecule, but may represent a (possibly empty) set of molecules. It can be stored in computer systems or on a piece of paper, therefore its physical existence is an instance of IAO:information_content_entity. To correctly model reference sequences, it is crucial to distinguish between the sequence content and the file storing the sequence content, and therefore define RSA:ReferenceSequence not a subclass of IAO:information_content_entity. For example, both transcript sequences and chromosome sequences can be used as reference sequences, so instances of RSA:ReferenceSequence can be ro:proper_part_of another instance. This part of relationship is important for data integration scenarios shown in the next section, and this part of relationship works only if RSA:SequenceAnnotation is defined by the sequence content, as the sequence content of a transcript can be part of the sequence content of a chromosome. However, if RSA:ReferenceSequence is defined as a subclass of IAO:information_content_entity, the part of relationship cannot be modeled because the file of a transcript sequence is not part of the file of the chromosome sequence.
In addition, we were confronted with the limitations of the reality constraint of BFO [30]. In the field of sequence annotations, biologists often work with abstract entities that only have an indirect relation to entities that exist in reality. For instance, the notion of a consensus sequence is widely used in practice. Consensus sequences are hypothetical sequences designed to capture information not from single molecules, but from sets similar molecules. In the case of reference sequence modeling we must accommodate consensus sequences. If we modeled RSA:ReferenceSequence as a subclass of GDC, the instance hg19:chr1 (chromosome 1 in human genome assembly version 19) inheres in an instance of a corresponding molecular sequence. However, there is no molecular sequence that corresponds with the sequence content of hg19:chr1, because hg19:chr1 is the consensus of the sequence content of chromosome 1 of multiple people. The consensus sequence modeling problem not only applies to sequences in genome assemblies, but also to all sequences generated by Next Generation Sequencing technologies. Even in the context of personal genome sequencing, a sequence may not be derived from a single molecule from a single cell, but from a set of molecules from multiple cells. As discussed by Hoehndorf et al., proper definitions of biological sequences require the upper ontological framework to handle hypothetical sequences [26]. Thus, we argue that how to define a consensus sequence within the framework of BFO and OBO needs to be addressed by the OBO community. SO provides class SO:consensus_region for consensus sequences. However, this class is not aligned with BFO, and it is unclear whether this class is designed with OBO principles.
Finally, RSA:ReferenceAssembly is an information entity encapsulating a set of RSA:ReferenceSequence s that are often used together to represent the total sequence content of an organism, and RSA:ReferenceSequence is RO:proper_part_of RSA:ReferenceAssembly. Its version number (in some cases, the timestamp) is crucial for data integration. RSA:ReferenceAssembly cannot be aligned with BFO, because its parts are not aligned with BFO.
Semantic relations between annotations
With a complete ontological framework in place, we then investigated how sequence annotations using different reference sequences can be semantically linked. Semantic relationships between sequence annotations are determined by the relationship between their reference sequences. We categorized three types of reference sequence relationships that are crucial for data integration: 1) The two reference sequences represent the same biological entity; 2) One reference sequence is a syntactic part of the other reference sequence; 3) One reference sequence can be syntactically derived from the other reference sequence. Here, we show how each reference sequence relationship defines the relationship between annotations in Figure 4.
The same as relationship is important for integrating annotations based on different reference assemblies (Figure 4A). For example, the gene annotation of OR4F5 based on hg19 is the same as the one based on hg18. We note that the properties of the underlying reference sequence may differ, and hence the two annotations may have different properties (the start and end points on chromosome 1), but they share the same identifier (OR4F5). The equivalent relationship is important for integrating annotations with different sequence features (Figure 4B). For example, the variant annotation NM_004006.2:c.178C>T is equivalent to variant annotation NC_000023.10:g.32867853G>A. Although these two variant annotations are defined by different positions and different nucleotide substitutions, they describe the same biological variation from two different viewpoints. The c. notation uses transcript as the reference sequence and captures the effect of variation on RNA, whereas the g. notation uses chromosome as the reference sequence, and captures the effect on the genome. The derived from relationship is important for connecting annotations that occurred in different biological processes (Figure 4C). For example, variant annotation on the protein level NP_003997.1:p.Gln60* is derived from variant annotation on the transcript level NM_004006.2:c.178C>T.
Interoperability across reference assemblies
To define the relationship between reference sequences from different reference assemblies is not trivial. In line with semantic data integration strategies [29], our goal was to define the common domain of integration across reference assemblies at the chromosome level. However, this domain of integration is outside the scope of RSA. Modeling the relation between consensus sequence and chromosome in line with BFO was not straightforward. In this section, we present three possible methods to connect reference sequences across assemblies.
The first method uses the 'inheres in' property to relate the two instances of class Chromosome 1 that then represent the common domain (Figure 5A). This approach seems to follow BFO. However, we did not find an existing superclass for Chromosome 1, because hg19:chr1 and hg18:chr1 are consensus sequences that do not inhere in any particular chromosome. The superclass for Chromosome 1 would require an equivalent of a consensus chromosome that is a subclass of IC, which we have shown in the last section is not currently possible.
The second method is perhaps the least attractive, because it defines a relationship between an OWL:Individual and an OWL:Class that is not a class assertion, violating the OWL-DL definition and making reasoning over datasets undecidable. However, this method eliminates the need for 'real chromosome instances' required by the 'inheres in' relationship in the first method (Figure 5B).
The third method uses a single instance of an abstract class Chromosome 1 (Figure 5C), using the Genome Component Ontology (GCO). GCO does not follow BFO's realism viewpoint, and is intentionally kept as minimal as possible. It defines the abstract division of the total genetic information of an organism by its physical separation into different components, but not to describe any specific characteristics derived through experimentation. Instances of GCO:GenomeComponent provide high level references. More specific descriptions, such as gene content, length, function, location, loci or sequence, can be linked to instances representing instances of GCO:GenomeComponent.
Each method has advantages and disadvantages. We consider method 3 the best option for data integration, because it offers good features for linking and integrating data without violating OWL-DL restrictions. A disadvantage is that it is not aligned with BFO, which may impede integration with data annotated using a BFO-based ontology. Therefore, we retained only the minimal set of classes in GCO. The RDF representation for the model shown in Figure 4C is accessible at http://rdf.biosemantics.org/examples/gco_integration.