The need to create new services in order to answer typical biological use cases speaks to two general problems in the bioinformatics Web service community: (1) the insufficiency of current offerings by data and tool providers, and (2) the difficulty data and service providers face in predicting what services are needed by their end-users. By solving common bioinformatics use-cases, participants discovered that many straightforward, yet non-obvious operations were lacking. For example, given a set of gene names, return a phylogenetic profile, or, given a set of BLAST hits, group them according to other shared features such as annotations or motifs. Clearly, if a researcher is constructing a workflow, and the next required function is not available as a Web service or it is embedded inside a more complex Web service, then a dead-end is reached in the workflow construction. This was observed repeatedly during the BioHackathon.
To resolve the pathway and glycoinformatics use cases, participants found that several services needed to be written de novo, in part because the required services did not exist, and in part because there were incompatibilities between Java and Perl Web services due to their respective SOAP libraries. Importantly, it was noted that most of the workflow was dedicated to file-format conversions (referred to as "shims" in the workflow community ), which in some cases led to building the workflows at the code level, rather than in a workflow design tool. Some conversions may be automated by newly developed Web services, for example using TogoWS's URL API for converting between a number of commonly-used flat file formats or to translate any flat file to JSON or XML (and in the future to RDF). However, these are nevertheless a nuisance for biologist end-users, and it is suspected that even with the availability of "shim" Web services, a non-trivial amount of coding for format conversions is inevitable as new technologies (and therefore new data formats) appear increasingly rapidly. It was also noted that the non-standard output format of BLAST search in PDBj sequence navigator caused additional problems. Taken together, these nuisances provided a strong lesson to data providers that they should, where possible, stick rigorously to standard formats for which tools exist.
One might speculate that at least part of the granularity problem in bioinformatics stems from the fact that a lot of data retrieval and analysis involves passing large, complex, opaque, and highly specialized flat-files. The observation of the proliferation of "shims" in bioinformatics workflows provides additional evidence for this problem. It perhaps should not be surprising, therefore, that the level of granularity of Web services in bioinformatics is generally at a level similar to the level of granularity of the flat files being passed around; i.e. there is very little incentive for service providers to offer Web services at any higher level of granularity than the flat files they understand, despite these finer-grained operations being extremely useful in the context of novel workflows. While this does not account for all cases of "blockage", participants observed while trying to construct workflows for the use cases that it was certainly a root cause for some of them. The corollary to this observation is that highly granular Web services lead to extremely complex workflows - complexity at a level that may well be beyond the capabilities of end-users. The trade-off between granularity and complexity might be resolved, at least partially, by the inclusion of "semantics" in data-types and Web services, such that workflow synthesis is more intuitive.
Given the difficulty in anticipating what Web services might be needed for any given research question, one might therefore speculate whether it will ever be possible for non-programmer end-users to build their analyses entirely in a workflow client; however, we contend there are ways to mitigate this problem given adherence to some simple best practices the participants identified at the BioHackathon and which we will outline below.
From these observations, participants have derived several guidelines that they will adopt as they continue to develop their respective tools; it is hoped that these guidelines will be useful to other groups facing similar problems. In brief, the biological Web service guidelines propose standard specifications for methods of REST and SOAP to search for, retrieve and convert formats of database entries. They also propose the format of query strings and recommended data types. Finally, they require the preparation of sample codes and documents. The full descriptions of these guidelines are available at the BioHackathon website .
While participants do not propose these guidelines as a formal international standard, it is strongly encouraged that bioinformatics resource providers examine these suggestions and follow those that are relevant to their own resources in order to maximize their usability and interoperability. The activity to maintain and update these guidelines continues, both as individual providers and as the wider BioHackathon community, and participants are continuing to observe and evaluate new technologies such as the Semantic Web and Semantic Web services.
An important conclusion from these guidelines is that participants believe resource providers should be as comprehensive as possible in making all data manipulation and analysis resources available as Web services - attempt to anticipate all the ways that users might want to manipulate data records and try to make these functions available as individual Web services. Also, resource providers should attempt to be as fine-grained and modular as possible in the Web services they implement. If a Web service, for example, applies a data transformation on the incoming data prior to executing its analysis, then providers should consider publishing both the transformation and the analysis as separate services, as it is likely that both functions will be independently useful in workflows that the provider cannot anticipate.
One might argue, however, that by adopting this approach the situation becomes worse by increasing the complexity of workflows. Fortunately, modern workflow tools such as Taverna allow users to publish sets of connected Web services as individual, reusable workflow modules. Thus, from the perspective of consumers, they have the ability to see whatever level of granularity they need to see - either the entire sub-workflow as a single function, or the sub-workflow as a set of highly granular and re-usable modules that they can splice together as they see fit.
Motivating arguments for semantics
The discussions around semantics and the role of semantics in supporting automated workflow synthesis were diffused throughout the BioHackathon event, and there was considerable disagreement among the participants regarding the degree to which XML Schema and Schema-based Web services (i.e. WSDL) posed barriers to interoperability and integration, and the feasibility of alternatives.
Though the compelling argument in favor of XML Schema is the availability of a wide range of compatible tools, such tooling is rapidly becoming available for RDF/OWL also, thus this advantage becomes less significant every day. Moreover, as an increasing number of bioinformatics data providers begin publishing their data natively in RDF (e.g. UniProt ), it will soon become necessary not only to map XML Schema to one another to achieve interoperability, but also to map RDF-Schema to XML Schema (e.g. using SAWSDL ) in order to utilize traditional Web services. Representatives from the SADI project pointed out that this growing trend, in itself, should be sufficient motivation to look at alternatives to standard WSDL-based Web services.
Finally, our reliance on XML Schema has had other unintended consequences that not only thwart interoperability, but are making the problem worse over time. In a recent keynote address, Charles Petrie observed that "there are no practical Web services!" . What he meant is that problems of data formatting, lack of annotation, lack of semantics, and the resulting complexity of workflow construction have all led to the situation where Web services are seldom modeled as modular units that can be "cobbled together" in various ways to achieve a variety of ends; rather, they are more often large multi-functional units with very low granularity (and therefore low re-usability in novel workflows). While Petrie was describing primarily business-oriented Web services, the situation in bioinformatics is not entirely dissimilar, and it is precisely this lack of fine-grained services that was identified by the BioHackathon participants as lacking. The movement to RDF/OWL-based data representation will, we propose, naturally lead to finer-grained data models and APIs simply because there is no advantage whatsoever in using RDF/OWL to pass complex, opaque documents. This, in itself, should provide the incentive to break these documents into richer, more granular structures, and this in turn will lead to the creation of more fine-grained APIs that can operate over them.
Increasing the granularity of the workflow almost invariably increases the granularity of data-types (e.g. serving only a small fragment of a PDB record, rather than the entire PDB record). The problem of data-type granularity has long been of interest to the BioMoby project. The data-type ontology in BioMoby was specifically designed to offer a much higher level of granularity in data representation than the typical bioinformatics flat-file, and it is this enhanced granularity that has enabled the creation of tools like Daggoo that can semi-automatically wrap existing Web resources. In practice, however, few data providers utilize the higher granularity of the Moby data-type ontology, preferring to simply consume and produce the same opaque flat-file formats they did before, but with a BioMoby object wrapper. As such, there was a general feeling at the BioHackathon that BioMoby did not offer a useful solution to this particular problem.
The use of RDF/OWL provides very strong incentives for data and service providers to increase the granularity of their data formats because, by doing so, it is possible to embed machine-readable meaning into that data. This added layer of meaning can be used, for example, by visualization tools to enhance the interpretation of that data by automatically selecting an appropriate rendering or representation for a specific piece of data based on its semantic properties, or could be used by workflow construction tools to automate the conversion of one data format to another equivalent or related data format, thus simplifying the "shim" problem. The SADI initiative is already demonstrating the advantage of using RDF/OWL in Web services.