Friday, May 01, 2009

Varnish Reverse Proxy Accelerator (Ca...

The need to offer a mechanism to accelerate the performance of web applications could arguably be even more relevant as we move into a a more linked data approach.  With the need to possible offer multiple representations of a single resource URI to address needs for XML, RDF or XHTML+RDFa (or microformats) etc. the need to improve performance and address request from cache when possible increases.

Past efforts in this area have involved Apache Connectors to Tomcat (Ref: ) and also reviewing Glassfish.  Glassfish Portfolio (Ref: ) has many elements focused on performance and there is also Galssfish Grizzly (Ref: ) which offers some some impressive sounding performance aspects using NIO. 

However, to address needs of maintaining URI persistence and caching what was really needed was a "reverse-proxy accelerator".  Given that set of criteria I arrived soon at using Varnish (Ref: ).  The target application is hosted on Glassfish but obviously Varnish can reverse proxy for a number of machines and any backend talking over http.  Indeed, it's this ability that is important to me in order to address my desire to be able to alter systems and networking aspects behind Varnish and have it maintain my URI's in a "Cool URIs" style/approach (Ref: ). 

Following the nice tutorial by Jay Kuri, that is located on that site it is rather easy to get Varnish up and running on Linux and doing reverse proxy and caching.    I would like highlight a few things Jay, mentioned in that write up and detail a few aspects I ran into. 

One of the biggest that Jay, mentions is the issue around Varnish not wanting to cache anything that has a cookie reference.  Since almost all web based applications are going to do this as a simple session management approach (even if you are not doing accounts), Varnish by default will not cache your URIs.    Jay, goes on to recommend code like:

    if (obj.http.Cache-Control ~ "max-age") {
        unset obj.http.Set-Cookie;

to override this behavior and respond from cache for content.  As noted this does mean that is starts to become depending on the content provider/web app developer to issue the commands to inform cache system like Varnish about the behavior we expect from them.

Setting behavior in Grails:
Using Grails (Ref: ) it's easy to set the format of our return in via the withFormat{} syntax.  Note we would want to make the approapriate entries in our Config.groovy grails.mime.types section (Ref: The "content negotiation" section of )

So something like:

    withFormat {
      html {
        response.setHeader("Vary", "Accept")
        def nowPlusHour = new Date().time + 3600000
            String.format('%ta, %<te %<tb %<tY %<tH:%<tM:%<tS %<tZ', new Date()))
            String.format('%ta, %<te %<tb %<tY %<tH:%<tM:%<tS %<tZ',
             new Date(nowPlusHour)))

        [allSites: allSites, allAutoSites: onlyAutoSites]
      rdf {
          def data = modelAsRDFService.asRDF(AgeModel.findAllByLeg(,

          response.setHeader("Vary", "Accept")
          response.contentType = "application/rdf+xml"
          def nowPlusHour = new Date().time + 3600000
            String.format('%ta, %<te %<tb %<tY %<tH:%<tM:%<tS %<tZ', new Date()))
            String.format('%ta, %<te %<tb %<tY %<tH:%<tM:%<tS %<tZ',
             new Date(nowPlusHour)))
          response.outputStream << data

In this code we have set the LAST-MODIFIED and EXPIRES header entries. Note in this simple example I have simply pushed the expires time ahead by one hour.  You can set this however you wish depending on your view of the relative age your resources can be and still be valid.

The VARY header is set to address some linked data best practices.  This informs the client that the representation of this resource URI can change based on how we call it.  Here it can be requested as HTML  (what is in fact XHTML+RDFa in our case..  a whole topic in itself) and RDF.

There is no builder for RDF, so I simply pass the model to a service that generates and returns this for me.   The Export Plugin (Ref: ) might be worth looking at if you are looking for various other formats to serialize your data to.  I don't set the content type for the HTML in the code above, on review I likely should though by default it is coming back as text/html so that may be fine.  

Be sure to have your RDF contain a reference to itself and its resource URI if you are doing a linked data approach. 

There is also likely other response header elements one could consider here based on the reverse proxy and caching needs.  A review of and your accelerator package may reveal other setting of value for your particular environment.

Using curl to validate behavior:
While I really love Firebug, I found it to not be the easiest to use to verify proper cache behavior by Varnish.  Rather, I used (being a bit of CLI lover), curl.  A curl request on a resource URI is made using the -D out option to capture the headers.  Two consecutive headers are shown below:

HTTP/1.1 200 OK
X-Powered-By: Servlet/2.5
Server: Sun Java System Application Server 9.1
Vary: Accept
Last-Modified: Fri, 1 May 2009 16:10:00 CDT
Expires: Fri, 1 May 2009 17:10:00 CDT
Content-Type: image/png
Content-Length: 20951
Date: Fri, 01 May 2009 19:59:32 GMT
X-Varnish: 868897910
Age: 0
Via: 1.1 varnish
Connection: keep-alive

HTTP/1.1 200 OK
X-Powered-By: Servlet/2.5
Server: Sun Java System Application Server 9.1
Vary: Accept
Last-Modified: Fri, 1 May 2009 16:10:00 CDT
Expires: Fri, 1 May 2009 17:10:00 CDT
Content-Type: image/png
Content-Length: 20951
Date: Fri, 01 May 2009 19:59:34 GMT
X-Varnish: 868897911 868897910
Age: 2
Via: 1.1 varnish
Connection: keep-alive

Note the two values in the X-Varnish response header entry indicating to us that the content came from the previously cached content.  If we continually saw only one number here would need to investigate why our request is being passed through to the backend server.

Varnish, with a few considerations related to explicate cache control instructions provides a nice acceleration to web application performance.  As a from the start reverse proxy (vs something like Squid which started life as a forward proxy) it plays a valuable roll in linked data approaches by allowing changes to systems and networks behind the scene while allowing URI persistence to be maintained.  I've also had no issues with it with respect to 303 redirects on generic documents, also important in linked data approaches. 

Monday, February 16, 2009

Neptune Database Services and Scripting:

The use of services on the net has on interesting set of issues to address with respect to acting as generic data resources.  Doing mash ups with RSS or ATOM feeds or weather or other somewhat periodic and serial data has become rather common.  Yahoo Pipes (Ref: ) and other such web applications do a good job dealing with these types of data end points. 

However, when working with scientific data sets and services like those at CHRONOS (Ref: ) a different set of use issues comes up.

Let's look at a basic XML-RPC styled service endpoint for the Neptune database at CHRONOS (Ref: ):

Here I have used a service from the XQE engine that has set the time range to 33-34 Ma,  fossil group to Diatoms, indicated that only dated items should be used and set the output serialization to web row sets (generic XML). 

However, let's say by use of a recent example, we wanted to locate all the unique taxon names and drill holes from various time ranges and fossil types from the CHRONSO Neptune database.  We now need to loop on a set time times and names and then locate the unique, not full, set of names and locations.  We can do this with the service above, but not directly.  We need to collect and process a range of calls on that service to achieve this. 

One could develop a small program in Java, Ruby Groovy or some other language.  Dynamic languages like Groovy are especially well suited for such efforts and a a simple script like:
10   //Upper Oligocene   Sub-Series/Sub-Epoch    23.8    28.5    Oligocene 
//Lower Oligocene Sub-Series/Sub-Epoch 28.5 33.7 Oligocene
//Upper Eocene Sub-Series/Sub-Epoch 33.7 36.9 Eocene
//Middle Eocene Sub-Series/Sub-Epoch 36.9 49 Eocene
//Lower Eocene Sub-Series/Sub-Epoch 49 55.05 Eocene
//Paleocene Series/Epoch 55.05 65 Paleogene
import org.apache.commons.httpclient.methods.GetMethod
def ageRanges = ['23.8-28.5', '28.5-35.7', '33.7-36.9', '36.9-49', '49-55.05', '55.05-65']
def fossilTypes = ["Diatoms", "Planktonic+Foraminifera", "Radiolarians", "Calcareous+Nannoplankton"]
23 fossilTypes.each() {type ->
24 ageRanges.each() {range ->
def urlxml = "
def client = new HttpClient()
def get = new GetMethod(urlxml)
29 client.executeMethod(get)
def xmlSlurped = null
try {
33 xmlSlurped =
new XmlSlurper().parseText(get.getResponseBodyAsString())
def entries =
def legSiteHole = []
def bug = []
38 entries.each {entry ->
39 legSiteHole +=
40 bug +=
41 }
43 println
"Total tax count is ${bug.size()} and total LSH count is ${legSiteHole.size()}"
44 println
"In range " + range + " of type " + type + " found " + bug.unique().size() + " unique
taxa in "
+ legSiteHole.unique().size() + " unique leg/site/holes"
45 println range +
" " + bug.unique().size()
46 }
catch (org.xml.sax.SAXParseException e) {
47 println
"Exception parsing XML"
48 }
49 }
50 }
This script uses the Apache Commons HttpClient, but there are many way to do this.  The point I am after is not to show a best practice in using Groovy to access services  (I'm sure there are many aspects of that code that people could comment on).  Rather it is to highlight an issue related to balancing service generality and usefulness.

Note I did attempt to use both the XML and the character separated values
version via Yahoo Pipes and also the defunct Google Mashups Editor. 
However, the need to have a more programmatic interface to conduct
loops, basic array manipulation and such is beyond what these can do
easily.  While it might be possible, I did not find it easy or intuitive.  These resource seem more focused on RSS or ATOM sources and simple filtering and counting. 

Additionally the
Kepler work flow application is a bit of over kill and doesn't yet seem to
work well with REST or XML-RPC style services as well as I feel it should.   A heavy focus on WS* and other more heavy duty scientific data flow operations mean it's lightweight rapid service mashup capacity is limited (IMHO). 

Making a resource or service of general "popular" use may make it of generic interest but can limit its utility on its own to address specific scientific efforts.

Expecting generic services to be of use via the implementation of packages like Yahoo Pipes or Kepler runs into implementation aspects of those tools.  Aspects which may make use of them sufficiently complex as too discourage users. 

However, expecting people to write code like the example above, even in a language like Groovy that makes calling, using and parsing the data rather easy is also potentially unrealistic.  Not everyone likes to write code.

So, there is a dilemma for service providers of science data like CHRONOS.  Make the services too generic and you risk them being of little use to focused research communities.  Address the vertical needs of a specific community or tool interface and risk making them of limited utility to others.  Additionally addressing the needs of several vertical communities could be taxing from a human resource point of view. 

One does wonder if the creation of a domain specific language (DSL)  might have benefit to address this.  If there was enough community interest to justify it's creation a service that consumes and process a DSL would allow a sort of "custom command line interface. 

The Sloan Digital Sky Survey  SkyServer (Ref: ) addresses this by simply using SQL as that "DSL" language.  The page allows one to structure and submit SQL directly to the database.   

However, a more focused DSL might be able to address the needs of special research groups while consuming more generic service end points that could be exposed for other to use in similar approaches to the code above, via the DSL or their own approach. 

Also the development effort in making such a DSL might be able to be spread across multiple efforts making it more appealing from a human resource point of view.

Just some ramblings on how to balance generic services (resources) vs the need of special communities for more focused and unique services.  Regardless if they are REST, XML-RPC or WS* in nature. 

Other references:
Kepler Workflow Application:
Google Mashup Editor   (defunct)
XPROC   and
DSL in Groovy: