Monday, February 16, 2009

Neptune Database Services and Scripting:

The use of services on the net has on interesting set of issues to address with respect to acting as generic data resources.  Doing mash ups with RSS or ATOM feeds or weather or other somewhat periodic and serial data has become rather common.  Yahoo Pipes (Ref: http://pipes.yahoo.com ) and other such web applications do a good job dealing with these types of data end points. 

However, when working with scientific data sets and services like those at CHRONOS (Ref: http://portal.chronos.org/gridsphere/gridsphere?cid=res_dev ) a different set of use issues comes up.

Let's look at a basic XML-RPC styled service endpoint for the Neptune database at CHRONOS (Ref: http://www.chronos.org ):

http://services.chronos.org/xqe/public/chronos.neptune.samples.advanced?callback=displayNexus&time_range=33-34&fossil_group=Diatoms&include_dated=true&serializeAs=wrs&

Here I have used a service from the XQE engine that has set the time range to 33-34 Ma,  fossil group to Diatoms, indicated that only dated items should be used and set the output serialization to web row sets (generic XML). 

However, let's say by use of a recent example, we wanted to locate all the unique taxon names and drill holes from various time ranges and fossil types from the CHRONSO Neptune database.  We now need to loop on a set time times and names and then locate the unique, not full, set of names and locations.  We can do this with the service above, but not directly.  We need to collect and process a range of calls on that service to achieve this. 

One could develop a small program in Java, Ruby Groovy or some other language.  Dynamic languages like Groovy are especially well suited for such efforts and a a simple script like:
10   //Upper Oligocene   Sub-Series/Sub-Epoch    23.8    28.5    Oligocene 
11
//Lower Oligocene Sub-Series/Sub-Epoch 28.5 33.7 Oligocene
12
//Upper Eocene Sub-Series/Sub-Epoch 33.7 36.9 Eocene
13
//Middle Eocene Sub-Series/Sub-Epoch 36.9 49 Eocene
14
//Lower Eocene Sub-Series/Sub-Epoch 49 55.05 Eocene
15
//Paleocene Series/Epoch 55.05 65 Paleogene
16
17
import sun.net.www.http.HttpClient
18
import org.apache.commons.httpclient.methods.GetMethod
19
20
def ageRanges = ['23.8-28.5', '28.5-35.7', '33.7-36.9', '36.9-49', '49-55.05', '55.05-65']
21
def fossilTypes = ["Diatoms", "Planktonic+Foraminifera", "Radiolarians", "Calcareous+Nannoplankton"]
22
23 fossilTypes.each() {type ->
24 ageRanges.each() {range ->
25
def urlxml = "http://services.chronos.org/xqe/public/chronos.neptune.samples.advanced?callback=
displayNexus&time_range=$
{range}&fossil_group=${type}&include_synonyms=true&include_dated=true&serializeAs=wrs&"
26
27
def client = new HttpClient()
28
def get = new GetMethod(urlxml)
29 client.executeMethod(get)
30
31
def xmlSlurped = null
32
try {
33 xmlSlurped =
new XmlSlurper().parseText(get.getResponseBodyAsString())
34
35
def entries = xmlSlurped.data.currentRow
36
def legSiteHole = []
37
def bug = []
38 entries.each {entry ->
39 legSiteHole +=
"${entry.columnValue[3]}_${entry.columnValue[4]}_${entry.columnValue[5]}"
40 bug +=
"${entry.columnValue[11]}"
41 }
42
43 println
"Total tax count is ${bug.size()} and total LSH count is ${legSiteHole.size()}"
44 println
"In range " + range + " of type " + type + " found " + bug.unique().size() + " unique
taxa in "
+ legSiteHole.unique().size() + " unique leg/site/holes"
45 println range +
" " + bug.unique().size()
46 }
catch (org.xml.sax.SAXParseException e) {
47 println
"Exception parsing XML"
48 }
49 }
50 }
This script uses the Apache Commons HttpClient, but there are many way to do this.  The point I am after is not to show a best practice in using Groovy to access services  (I'm sure there are many aspects of that code that people could comment on).  Rather it is to highlight an issue related to balancing service generality and usefulness.

Note I did attempt to use both the XML and the character separated values
version via Yahoo Pipes and also the defunct Google Mashups Editor. 
However, the need to have a more programmatic interface to conduct
loops, basic array manipulation and such is beyond what these can do
easily.  While it might be possible, I did not find it easy or intuitive.  These resource seem more focused on RSS or ATOM sources and simple filtering and counting. 

Additionally the
Kepler work flow application is a bit of over kill and doesn't yet seem to
work well with REST or XML-RPC style services as well as I feel it should.   A heavy focus on WS* and other more heavy duty scientific data flow operations mean it's lightweight rapid service mashup capacity is limited (IMHO). 

Making a resource or service of general "popular" use may make it of generic interest but can limit its utility on its own to address specific scientific efforts.

Expecting generic services to be of use via the implementation of packages like Yahoo Pipes or Kepler runs into implementation aspects of those tools.  Aspects which may make use of them sufficiently complex as too discourage users. 

However, expecting people to write code like the example above, even in a language like Groovy that makes calling, using and parsing the data rather easy is also potentially unrealistic.  Not everyone likes to write code.

So, there is a dilemma for service providers of science data like CHRONOS.  Make the services too generic and you risk them being of little use to focused research communities.  Address the vertical needs of a specific community or tool interface and risk making them of limited utility to others.  Additionally addressing the needs of several vertical communities could be taxing from a human resource point of view. 

One does wonder if the creation of a domain specific language (DSL)  might have benefit to address this.  If there was enough community interest to justify it's creation a service that consumes and process a DSL would allow a sort of "custom command line interface. 

The Sloan Digital Sky Survey  SkyServer (Ref: http://cas.sdss.org/astrodr7/en/ ) addresses this by simply using SQL as that "DSL" language.  The page http://cas.sdss.org/astrodr7/en/tools/search/sql.asp allows one to structure and submit SQL directly to the database.   

However, a more focused DSL might be able to address the needs of special research groups while consuming more generic service end points that could be exposed for other to use in similar approaches to the code above, via the DSL or their own approach. 

Also the development effort in making such a DSL might be able to be spread across multiple efforts making it more appealing from a human resource point of view.

Just some ramblings on how to balance generic services (resources) vs the need of special communities for more focused and unique services.  Regardless if they are REST, XML-RPC or WS* in nature. 

Other references:
Kepler Workflow Application: http://kepler-project.org/
Google Mashup Editor  http://code.google.com/gme/   (defunct)
XPROC  http://xproc.org/   and  http://www.w3.org/TR/xproc/
GRDDL: http://www.w3.org/2004/01/rdxh/spec
DSL in Groovy:  http://docs.codehaus.org/display/GROOVY/Writing+Domain-Specific+Languages

No comments: