RDF2Graph performs two distinct processes to retrieve the structure of a resource. Initially, there is a recovery of all classes, predicates and unique type links together with their associated statistics. In the second stage there is a simplification step to arrive to a neat structural overview. A simplified overview of the complete process is given in Fig. 1.
A type link is defined as a link joining a subject class type to an object class or value data type, via a predicate. A unique type link is defined as a unique tuple: type of subject, predicate, (data)type of object. For the triple <:BRCA1, :locatedOn, :chromosome17> the type link is <:Gene, :locatedOn, :Chromosome>. When considering the full resource, all type links <:Gene, :locatedOn, :Chromosome> correspond to the same unique type link. In the triple <:Adam :hasSon :Bob> the type link is <:Person, :hasSon, :Person>.
The multiplicity of a unique type link describes the number of instances connected to each other. The forward multiplicity can be: i) One-to-one (also denoted: 1..1) each source instance has exactly one reference to the target; ii) One-or-many (1..N) each source instance has one or more references to the target; iii) Zero-or-one (0..1) some source instances have at most one reference to the target; iv) Zero-or-many (0..N) some source instances have one or even more than one reference to the target. Similarly, for the reverse multiplicity the roles of target and source are inverted. In the previous examples, the forward multiplicity of the unique type link <:Gene, :locatedOn, :Chromosome> is (1:1) since each human gene is associated to one and only one chromosome, whereas the reverse multiplicity is (1..N) since a chromosome contains many genes. In the second case <:Person, :hasSon, :Person>, the forward multiplicity is (0..N) since there is no limitation to the number of sons a person may have; in this case the reverse multiplicity is (N = 2..1) given that each son has two parents.
The initial recovery process is performed through a series of SPARQL queries on the selected endpoint. Detailed information about the SPARQL queries and the queries themselves are provided in RDF2Graph’s documentation. These queries can be adapted to change the introduced limitations and to customise the tool for specific end points. The queries can be limited to reduce the running time since this process can take between a few minutes for a resource with a million triples, to several days for a resource with 16 billion triples, such as the RDF version of UniProt, as described in the ‘Results’ section. However the limitation in the number of retrieved triples may lead to incompleteness of the recovered structure, since some type links could be missed. This may cause that for some unique type links not all type links are retrieved, which can cause errors in the calculation of the multiplicities (forward and/or reverse). It may also lead to some unique type links not being identified if no type links associated to them are found. Therefore, we advise caution when using these limitations.
After the initial recovery of type links and unique type links, a simplification process follows, in which type links with a common parent class for either the subject or object types are merged. These process proceeds in a pairwise manner, so that at each iteration only two unique type links sharing either the subject type or the object (data)type are considered. If more than two unique type links are present, the first two are merged, and their result is combined with the next one and so on until all have been considered. Therefore, only two unique type links at a time are merged. Figure 2 represents the cases that need to be considered when analysing two unique type links. In principle, other cases involving the “sameAs” relationship could appear, but in our approach, the “subClassOf” relationship also includes the “sameAs” relationship, which reduces all possible cases to the ones represented in Fig. 2.
This process also allows the identification of concept classes. A concept class is defined as a class that has no instances and no subclasses with some instances. A typical set of examples of concept classes are all the GO classes in the GO database [22]. This concept is needed to support the exclusion of them in the network view as they have little value for the structural overview and will overcrowd the network visualization.
All classes identified in the recovery process and associated subclassOf links are loaded into a memory based directed graph. This class tree is then used in the merging process. During the merging process five steps are executed per retrieved predicate. Step 1 is the initialization; step 2 performs the merging in case A and steps 3 and 4 are used for case B, whereas case C is the combination of cases A and B from Fig. 2; step 5 is the fictionalization step.
The following definitions of shared types and child of classes are used. Two types are shared if i) both are the same, ii) one is a parent class of the other, or iii) both have a common parent class in the class tree. A child of class is defined as follows: Class K is a a child of class L if either class K is equal to class L or class K is a (non)direct subclass of class L.
-
Step 1: For each unique type link found for the predicate currently processed a temporary link is added to the type of subject, which links to the (data) type of object. In this way a temporary link between both types is defined.
-
Step 2: For each class in the class tree all temporary links are copied to the respective parent class(es). Then, occurrences of case A from Fig. 2 are simplified by performing a search for pairs of temporary links which both point to a shared type. If found, the temporary links are merged and replaced by a new temporary link pointing to the common parent class.
-
Step 3: This step is executed as a per class recursion breadth first process over the class tree. For each temporary link of the currently processed class the number of direct ’child’ classes is counted if they have at least one link pointing to a type that is a child of the type pointed by the currently processed temporary link. When this count is one, the currently processed temporary link is removed from the currently processed class.
-
Step 4: This step is executed as a per class depth first process over the class tree. Each temporary link pointing to a type that is a child of the type pointed by any of the links in the parent classes of the currently processed class are removed.
-
Step 5: The remaining temporary links and the newly calculated unique type links are stored. The temporary links are cleaned from the class tree to enable the system to process the next predicate.
Results are stored in a local triple store that contains the unique type links and their count (number of type links associated to them) together with their forward and reverse multiplicities.
To store information for the new concept of unique type links we developed a new ontology. Figure 3 depicts the elements within this ontology that are related to storage of the unique type links. Each unique type link links to an object type which is either: i) a class; ii) a data type, such as xsd:integer; iii) external, a subject in another resource; or iv) invalid, a subject with no defined type. In each class the class property groups the associated unique type links per predicate and links them to the rdfs:Property. Additionally, the number of occurrences are stored for each class and predicate.