Justification for creating a new Semantic Web service standard
A decade ago, Stein expressed concern that, because a wide array of different approaches to Web service provision were emerging "a chaotic world of incompatible bioinformatics data standards will be replaced by a chaotic world of incompatible web service standards" [23]. It would be difficult to argue that those words were not prophetic! In an attempt to enhance interoperability between these resources post facto, independent projects began using semantics to help map between the data elements and representations used by each resource. These "Semantic Web service" initiatives themselves, however, took various approaches in their utilization of semantics.
Preceding both Semantic Web technologies and the widespread emergence of Web services in bioinformatics, TAMBIS [24] was a mediator system in which wrappers containing resource-specific queries were mapped to an overarching ontology of bioinformatics concepts. Thus the semantics of TAMBIS is separate from the individual resource interfaces, and the semantic layer acts to re-write multi-concept queries such that individual components of that query are executed by one or more resource-specific wrappers.
my Grid [25] used an extensive bioinformatics domain ontology to annotate traditional bioinformatics Web services within a formal model called "Feta" [14], designed primarily to enhance service discovery, rather than automate multi-service composition. Feta, thus, adds semantics to traditional Web services at the level of its own annotation of a service interface.
OWL-S [12] seeks to improve Web service interoperability by providing a standard OWL ontology for the description of Web services. OWL-S goes beyond the capabilities of WSDL in the sense that it aims to describe the effects of web services on the real world (e.g. adding a charge to a credit card). OWL- S describes the actions of a Web service in a similar manner to how the actions of an agent are described in the planning domain of AI. Each service has a set of pre-conditions and post-conditions which are expressed as boolean formulas over a set of state variables. OWL-S is complex and is under ongoing development.
SAWSDL (Semantic Annotations for WSDL)[13] is an extension to WSDL that attempts to bridge the gap between the world of syntactically described Web services and semantically described Web services. SAWSDL allows a service provider to "tag" parts of a WSDL service description with semantic annotations. These annotations either specify how to translate an XML schema element to/from an ontology instance in another language such as RDF (via the liftingSchemaMapping and loweringSchemaMapping attributes), and indicate that an XML element corresponds to a certain class in an ontology (via the modelReference attribute).
WSMO (Web service Modeling Ontology) [26] is a research project that has the same general goals as OWL-S. In contrast to OWL-S, WSMO uses its own modeling language, WSML (Web service Modeling Language) for encoding Web service descriptions. One advantage of WSML over OWL-S is that it has built-in syntax for encoding the boolean formulas that are used to describe the pre-conditions and post-conditions of the services. In contrast, OWL-S employs a more ad hoc approach where the formulas are encoded as XML literals or string literals in an external syntax such as PDDL [27].
caBIO (part of caCORE [28]) designed a traditional Web service API describing all "valid" operations for a given set of biological objects. Within the XML sent-to or received-from caBIO services are semantic annotations compliant with a (vast) domain vocabulary. Thus the semantics of caBIO data are contained in the values of XML elements, and the "meaning" of those XML elements themselves are defined by the caBIO API.
BioMoby [11] carries its semantics in the data-structures themselves, and unlike caBIO, does not constrain what operations can be done on any given biological object. BioMoby requires service providers to utilize a common, end-user-extensible ontology of biological data-types, and to consume and produce XML serializations of instances of that ontology. The BioMoby ontology is both hierarchical, and partitive, thus the element name at any given position in the resulting XML serialization, and its child-element structure, can change without changing the semantics of the data. This enhances interoperability because (a) the semantics of the data are self-describing and embedded in the data, and (b) complex messages can be utilized by more simplistic services by simply paying attention to those data-components that they understood. As a result, assembly of BioMoby Web services can be fully automated since the "meaning" of any given data message can be reliably interpreted by the recipient without the need of mediators. Unfortunately, this flexibility in the XML representation of the data precludes the ability to use XML Schema to describe the syntax of the message, and thus traditional Web service tools are of limited utility. Moreover, BioMoby's XML serialization is non-standard and only understood by other BioMoby services, hampering interoperability outside of the project.
SSWAP [29] also carries the semantics of the data in the message itself, however it utilizes Semantic Web standards to do so. SSWAP defines a shared, lightweight OWL model of a service interface, where RDF-XML instances of this model are used as both the interface definition and as the "container" of the input and output data during service invocation. Because OWL-RDF cannot (reliably) be described in XML Schema, and because SSWAP includes the service interface model as part of its required messaging "scaffold", SSWAP is also incompatible with traditional Web services toolkits, and requires project-specific tooling, but exhibits significant interoperability and automatability with other SSWAP services.
Though some of these approaches might still be considered "emergent", even the more mature ones are not in widespread use outside of their own communities. Moreover, each approach attempted to inject semantics at a different position within the normal Web Services paradigm, making many of these Semantic Web service approaches incompatible with one another.
To justify our creation of (yet) another approach to Semantic Web service provision, we must discuss both published and subjective observations of Web service functionality, and pinpoint areas that continue to be problematic with respect to either service discoverability, or service interoperability. Clearly, if we cannot demonstrate the potential for a significant improvement over the status quo, service providers will have no motivation to adopt this approach, and the project will fail. Here, then, are the core observations that compel us to attempt a novel strategy.
First, we, and others [14, 30], noted that Web services in bioinformatics (and other scientific domains) exhibit only a small subset of the full range of complex behaviours that service-oriented Architectures allow. With few exceptions, bioinformatics Web services are independent, idempotent, stateless, transformative, and atomic. This stands in stark contrast to Web service solutions to, for example, the ticket-ordering use-case that is commonly discussed in this domain. Almost invariably, bioinformatics Web services consume a specific input data type, and in a stateless and atomic operation, return related output data type(s) generated by whatever transformation the service executes on that input. That most services are transformative in this way suggests that attempting to declare or model the underlying business-process may be unnecessary in the bioinformatics domain - to quote Goble again, "any integration technology should only be as heavy as it needs to be". Indeed, this observation was made by both the Feta and BioMoby projects [11, 14], though both Feta and BioMoby acknowledged the need for some level of simple service type annotation to assist in discovery.
A second important consequence of the observation that bioinformatics services are transformative has not (to our knowledge) been previously highlighted; that is that the transformation of input to output implies that there is some relationship between that input and output, and this important metadata is not being captured or utilized by any current framework. We believe that these relationships, while not capturing the service's "business process" per se, capture with great accuracy the purpose of the service; moreover, through observations made on the students of training courses in Web service workflow composition, we (subjectively) concluded that these relationships are likely a more accurate reflection of the way our end-users think about these data transformations, versus annotating the algorithmic function as is done in BioMoby and Feta. For example, biologists do not execute a BLAST analysis because they wish to run a sequence similarity matrix over their input data; they execute a BLAST analysis because they are interested in finding sequences that are homologous to their input sequence - they are interested in the homology relationship, not the BLAST algorithm. As such, we believe that capturing these entity-relationships as service annotations is an important criterion for enhancing discovery of relevant services by our target users. This observation lead to our second core best-practice: that services add their output to the input node via a meaningful property describing the relationship between input and output, and services may therefore be indexed and discovered based on that property.
Our third observation was twofold. On one hand, we noticed a general sense of disdain, bordering on frustration, within much of the bioinformatics community with respect to the SOAP protocol in general, and the incompatibilities between various language and platform-specific implementations of SOAP. With the distinct exception of the National Cancer Institute's caBIO framework, bioinformatics resources only rarely implement SOAP interfaces that utilize the Object-oriented style that SOAP allows, and even fewer take advantage of the rich features of the SOAP envelope such as intermediaries and message paths. Other than caBIO, almost all bioinformatics Web interfaces are straightforward, single-operation request/response. For example, the SOAP interface of TogoWS [31] provides a KeggGetEnzymesByPathway function that consumes a KEGG pathway identifier and responds with a list of related Enzymes. For these kinds of services, the overhead of SOAP is (demonstrably) unnecessary, so we feel it would be preferable to avoid SOAP entirely. On the other hand, there is an increasingly positive attitude in our community towards "RESTful" architectures [32]. It is worth taking a moment to dissect this goodwill, however, since it is in our opinion slightly misplaced. Few, if any, bioinformatics interfaces that claim to be RESTful are truly following a REST architecture. To be RESTful, all entities would be named resources whose states are manipulated through a limited number of methods. This is not a trivial architecture to achieve in practice, and most importantly is not, in any way, the same as declaring that all parameters for all functions should be part of a URL. Such interfaces (i.e. the vast majority of "RESTful" interfaces in bioinformatics) would better be described as CGI GET-based interfaces. For example, the "REST" interface of PhyloWS [33] consumes a specially-formatted query URL including a clade identifier and other key/value parameters, and returns a phylogenetic subtree. There is no identifiable resource whose state is being manipulated by that operation, and while it might be argued that every conceivable query is its own GET-able resource, such an argument would be a contrived interpretation of REST philosophy. As such, we believe that the bioinformatics community's goodwill is directed at interfaces that limit themselves to "pure" HTTP Protocol, rather than REST per se. As such, we decided to utilize straightforward HTTP GET and POST for SADI, relying heavily on standard HTTP response codes for special cases, though we do not claim SADI to be "RESTful".
Fourth, after observing the barriers to up-take of both BioMoby and SSWAP, it became clear that project- or protocol-specific message scaffolding should be avoided. As such, the SADI recommendation is to pass data only, with no scaffolding whatsoever.
Finally, we made a subjective evaluation of the cause of failure in (most) precedent interoperability architectures, and concluded that, in our opinion, XML Schema is the problem and should be abandoned. To briefly justify this conclusion, we observe the following: XML Schema has been described as "far and away the most complex data model ever proposed" and "seriously flawed" [34]. Bring into this complexity the number of different aspects of our target domain that need to be represented (Strömbäck et al. found 85 different schemas within the sub-domain of systems biology alone[35]), and there is immediately a requirement for either schema standardization, or schema mapping to facilitate interoperability. Schema standardization is "prohibitively time-consuming" [36], and though there have been numerous attempts to automate schema mapping - that is, the ability for two schema to exchange data, as would be required to automate the interaction between arbitrary Web services - none have proven reliable in an open-Web situation [37]. Automated Schema mapping is likely an AI-complete problem since it requires the mapping of arbitrarily chosen natural-language labels (XML tags) to one another based on the semantics of either the tag or its child-content. As such, Schema mapping approaches are unlikely to yield an acceptable result in the foreseeable future. This barrier has had significant and destructive consequences beyond the obvious thwarting of interoperability. The inability to automatically map between Schema has resulted, counter-intuitively, in an increase in the complexity of Web service interfaces. Since it is extremely difficult to pipeline traditional Web services together reliably, there is little point in making their operations highly granular; it is more "efficient" to simply execute the entire service operation as a single function-call. This, in turn, increases the complexity of the input and output messages[38] making schema mapping even more difficult. Our final observation is that, there is considerable early-adoption of Semantic Web technologies in the life sciences, with several significant organizations already publishing their data in RDF format (e.g. UniProt [39]). If we continue using XML Schema-based services, we may soon find ourselves mapping semantically rich data back into semantically impoverished XML in order to analyse it (this is, in fact, the purpose of the SAWSDL specification!). This would defeat the purpose of utilizing Semantic Web technologies in the first place. Clearly, more is gained by natively taking advantage of the enhanced interoperability inherent in RDF representations of data, than is gained through trying to support legacy Schema-based interfaces. For all of these reasons, we utilize RDF/OWL as both our interface description and messaging layer, and require it for all SADI-compliant interfaces. Moreover, we suggest that our community's continued adherence to traditional Schema-based Web service specifications will, at best, be destructive to their attempts to be interoperable. To quote Lincoln Stein, "to achieve seamless interoperability among online databases, data providers must change their ways" [23].
SADI and the Linked Data movement
The behaviour of SADI is consistent with, and in fact furthers the goals of the Linked Data[40] community. Consider, for example, what happens in a SADI service workflow, such as those automatically generated by the SHARE client. Input data is passed to a service, and comes back with output data attached. That output data may be utilized as input to a subsequent service, and so on. As the data flows through that workflow, a rich Linked Data graph is being constructed where every input is semantically linked to every associated output. This graph of dynamically generated data can be integrated with traditional static Linked Data resources, and queried or explored using standard Linked Data toolkits.
SADI and the Semantic Web
SADI merges the domains of Web services and the Semantic Web in a novel way. Every service generates one or more "edges" on an RDF graph, where the edge that will be generated is defined as a property restriction in an OWL ontology. Therefore, in SADI, OWL property restrictions "represent" potential services, and therefore SADI can be used to generate instances of OWL classes through service discovery based on these property restrictions. OWL, effectively, becomes an abstract workflow language. Moreover, any OWL document - whether created for this purpose or not - can be used by SADI-enabled software to retrieve instance data, so long as SADI services exist that map to the properties used in the ontology. Thus SADI is able to take advantage of any Semantic Web ontology.
Finally, while the bioinformatics community continues to utilize large, complex, semantically opaque flat-files, we believe that SADI (and the Semantic Web in general) starts to provide greater impetus to break-out the semantics of these files and increase the granularity of both data and services in the bioinformatics space. While SADI does not dictate the nature of the input and output data, it would be somewhat absurd for a SADI BLAST service to output a BLAST flat-file linked to its input sequence by a (nonsensical) "hasBLASTReport" property. Instead, the Linked-Data Web that SADI services build make it much more useful to output a parsed BLAST report, where each "hit" is linked to the original input sequence through some form of "sharesSimilarityTo" relationship. Thus, by challenging service providers to make their services discoverable through a biological relationship, rather than a algorithmic one, we believe SADI will provide the incentive to move beyond semantically opaque text reports and start explicitly encoding the semantics contained in those documents, resulting in a much richer data ecosystem.
SADI and other emergent Semantic Web service standards
As noted above, several of the existing Semantic Web service approaches are relatively new, and may still experience widespread adoption. Among these, the SAWSDL specification seems to be gaining considerable traction, though for reasons discussed earlier, we have some concerns about the utility of this standard in an RDF-based world, and about the lack of rigour in the standard itself. Description of SADI services using the SAWSDL standard is trivial, but not particularly useful. SAWSDL enhances traditional WSDL documents by indicating a semantic type for the service's input and output XML elements, and indicates a "lifting" or "lowering" schema to guide the transformation of RDF data into XML and back again. In SADI, the semantic types are simply the OWL Classes that the service provider declare as their input and output. Moreover, because the service natively consumes RDF there is no need for a lifting or lowering schema (or at worst, the lifting and lowering is an identity transformation). Nevertheless, since the SAWSDL specification gives no guidance as to the format of these lifting and lowering schemas, or how to interpret them, and since OWL Individuals cannot reliably be described using XML Schema, there will need to be an additional level of, as yet non-standardized community agreement before SAWSDL services (SADI or otherwise) could expect to be interoperable. Moreover, the myGrid/Moby service ontology contains far more detailed annotation than a SAWSDL document, and these detailed annotations are useful for both service discovery as well as service maintenance and testing. As such, while SADI is superficially compatible with the SAWSDL standard, we find the standard itself lacking for our purposes.
Limitations of SADI
SADI suffers from the same limitations that pose barriers to other Web service and Semantic Web projects [41]. As an interoperability system, the utility of SADI is entirely dependent on the number of providers who adopt its conventions. We recognize that there is extensive tooling support for traditional Web services and there is a perceived simplicity of XML compared to RDF/OWL. Moreover, there are thousands of legacy bioinformatics Web services that are not interoperable (neither with each other, nor with SADI services), and thus there would appear to be little benefit to becoming an early-adopter of SADI. To counter this, we have created software libraries that partially automate the process of service construction in both Perl and Java. Similar to the "Dashboard" application for BioMoby[42], a plug-in has been created that integrates a SADI service development environment into the Protégé [43] ontology editing application, where the user designs the ontologies describing their data, the plug-in creates the service scaffold, and the provider adds their business logic, setting the values of "stubs" provided by the service scaffold. This automation is possible because the behaviors of SADI services are predictable, and thus the code for SADI services is similarly consistent and predictable. In addition, we believe that the SAWSDL specification, together with XML transformations, will allow us to build semi-automated "wrappers" around traditional Web services that will make them SADI-compliant (at the expense of a loss in semantic richness versus creating a native SADI service). In this way, we hope to bootstrap the SADI project by first simplifying the task of service provision, and then by creating a core set of interoperable services that these Providers can link into. At the time of writing, there are more than 400 bioinformatics and chemoinformatics services available in the SADI registry[44], and several hundred more will be published by our team of collaborators by the end of this year.
The reliance of SADI on the Semantic Web also exposes limitations. In particular, success of the SADI architecture (like the success of the Semantic Web itself) will largely depend on widespread re-use of publicly-available and well-defined ontological predicates, and the definition of service inputs in terms of OWL restrictions on these properties. Unfortunately, the majority of focus in the Semantic Web efforts of the health-care and life science community thus far has been on defining classes, rather than predicates; asserting class-hierarchies without formally defining what properties a member of that class is expected to have, or what distinguishes members of one class from another. We hope, however, that the power we have demonstrated in these prototype implementations provides a sufficiently compelling argument to initiate the evolution of a slightly higher level of Semantic Web complexity in the health-care and life-sciences space.