Digithead's Lab Notebook: interoperability

Showing posts with label interoperability. Show all posts

Tuesday, July 10, 2012

Katy Börner's Plug and Play Macroscopes

Two Cytoscape engineers pointed me towards Plug and Play Macroscopes by Katy Börner. The paper envisions highly flexible and configurable software tools for science through the mechanism of plugin architecture, and is highly worth reading if you're involved in building scientific software.

Decision making in science, industry, and politics, as well as in daily life, requires that we make sense of data sets representing the structure and dynamics of complex systems. Analysis, navigation, and management of these continuously evolving data sets require a new kind of data-analysis and visualization tool we call a macroscope...

A macroscope is a modular software framework enabling end users - biologists, physicists, social scientists - to assemble customized tools reusing and recombining data sources, algorithms and visualizations. These tools consist of reconfigurable bundles of software plug-ins, using the same OSGi framework as the Eclipse IDE. Once a component has been packaged as a plug-in, it can be shared and combined with other components in new and creative ways to make sense of complexity, synthesizing related elements, finding patterns and detecting outliers. Börner sees these software tools as instruments on par with the microscope and the telescope.

CIShell

These concepts are implemented in a framework called Cyberinfrastructure Shell (CIShell), who's source is in the CIShell repo on GitHub. The core abstraction inside CIShell is that of an algorithm - something that can be executed, might throw exceptions, and returns some data. Data is some object that has a format and key/value metadata.

public interface Algorithm {
   public Data[] execute() throws AlgorithmExecutionException; 
}

public interface Data {
  public Dictionary getMetadata();
  public Object getData();
  public String getFormat();
}

Parenthetically, it's too bad there's no really universal abstraction for a function in Java... Callable, Runnable, Method, Function. In general, trying to wedge dynamic code into the highly static world of Java is not the most natural fit.

I'm guessing that integrating a tool involves wrapping it's functionality in a set of algorithm implementations.

The framework also features what looks to be support for dynamic GUI construction, a database abstraction with a slot for a named schema and support for scripting in Python.

An upcoming version of Cytoscape is built on OSGi. Someone should write a genome browser along these same lines.

"To serve the needs of scientists the core architecture must empower non-programmers to plug, play, and share their algorithms and to design custom macroscopes and other tools. " In my experience, scientists who are capable of designing workflows in visual tools are not afraid of learning enough R or Python to accomplish much the same thing. I'm not saying it's obvious that they should do that. Just that the trade-offs are worth considering. The real benefit comes from raising the level of abstraction, rather than replacing command-line code with point-and-click GUIs.

Means of composition

Plugin architecture isn't the only way to compose independently developed software tools. My lab's Gaggle framework links software through messaging. Service oriented architecture boils down to composition of web services, for example Taverna and MyGrid. GenePattern and Galaxy both fit into this space, although I'm not sure I can do a good job of characterizing them. If I understand correctly, both seem to use common file formats and conversions between them as the intermediary between programs. The classic means of composition are Unix pipes and filters - small programs loosely joined - and scripting.

Visualization

In a keynote on Visual Design Principles at VIZBI in March of this year, Börner channeled inspiration from Yann Arthus-Bertrand's Home and Earth from Above, Drew Berry and Edward Tufte, and advised students of visualization to study human visual perception and cognitive processing first in order to "design for the human system".

Her workflow combines data-centric steps similar to processes outlined by Jeffrey Heer and Hadley Wickham with a detailed breakdown of the elements of visualization.

Katy Börner directs the Cyberinfrastructure for Network Science Center at Indiana University. Not content to have written the Atlas of Science (MIT press 2010), she is currently hanging out in Amsterdam writing an Atlas of Knowledge.

After the VizBi talk, someone asked about shared data models and mentioned UIMA (Unstructured Information Management Architecture) as an example of a common type system for interfaces between components
Börner co-authored a chapter on Network Science
Whole Brain Catalog
Edward Tufte is doing some visualization workshops in Seattle and Portland this month
VisWeek 2012 is in Seattle on October 14-19
Howard Hughes Medical Institute BioInteractive

Saturday, June 23, 2012

Composition methods compared

Clojurist, technomancer and Leiningen creator, Phil Hagelberg does a nice job of dissecting "two ways to compose a number of small programs into a coherent system". Read the original in which three programming methods are compared. These are my notes, quoted mostly verbatim:

The Unix way

Consists of many small programs which communicate by sending text over pipes or using the occasional signal. Around this compelling simplicity and universality has grown a rich ecosystem of text-based processes with a long history of well-understood conventions. Anyone can tie into it with programs written in any language. But it's not well-suited for everything: sometimes the requirement of keeping each part of the system in its own process is too high a price to pay, and sometimes circumstances require a richer communication channel than just a stream of text.

The Emacs way

A small core written in a low-level language implements a higher-level language in which most of the rest of the program is implemented. Not only does the higher-level language ease the development of the trickier parts of the program, but it also makes it much easier to implement a good extension system since extensions are placed on even ground with the original program itself.

The core Mozilla platform is implemented mostly in a gnarly mash of C++, but applications like Firefox and Conkeror are primarily written in JavaScript, as are extensions.

Tuesday, May 08, 2012

Design Philosophies of Developer Tools

Clojure hacker Stuart Sierra wrote an insightful piece on the design philosophies of developer tools. His conclusions are paraphrased here:

Git is an elegantly designed system of many small programs, linked by the shared repository data structure.

Maven plugins, by contrast, are not designed to be composed and the APIs are underdocumented. With Maven, plugins fit into the common framework of the engine, which strikes me as maybe a more difficult proposition than many small programs working on a common data structure.

Of the anarchic situation with Ruby package management (Gems, Bundler, and RVM) Sierra says, “layers of indirection make debugging harder,” and “The speed of development comes with its own cost.”

Principles

Plan for integration
Rigorously specify the boundaries and extension points of your system

...and I really like this idea:

The filesystem is the universal integration point
Fork/exec is the universal plugin architecture

Sunday, April 29, 2012

The Scalable Adapter Design Pattern for Interoperability

When wrestling with a gnarly problem, it's interesting to compare notes with others who've faced the same dilemma. Having worked on an interoperability framework, a system called Gaggle, I had a feeling of familiarity when I came across this well-thought-out paper:

The Scalable Adapter Design Pattern: Enabling Interoperability between Educational Software Tools, Harrer, Pinkwart, McLaren, Scheuer, IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, 2008

The paper describes a design pattern for getting heterogeneous software tools to interoperate with each other by exchanging data. Each application is augmented with an adapter that can interact with a shared representation. This hub-and-spokes model makes sense because it reduces the effort from writing n(n-1) adapters to connect all pairs of applications to writing one adapter per application.

Scalability refers to the composite design pattern, implementing (what I would call) a more general concept, that of hierarchically structured data. If you've ever worked with large XML documents, calling them scalable might seem like an overstatement, but I see their point. XML nicely represents small objects, like events, as well as moderately sized data documents. The same can be said of JSON.

Applications remain decoupled from the data structure, with an adapter mediating between the two. The adapter also provides events when the shared data structure is updated. A nice effect of the hierarchical data structure is that client applications can register their interest in receiving events at different levels of the tree structure.

The Scalable Adapter pattern combines of well established patterns - Adapter, Composite and Publish-subscribe yielding a flexible way for arbitrary application data to be exchanged at different levels of granularity.

The main difference between Scalable Adapter and Gaggle is that Gaggle focused on message passing rather than centrally stored data. The paper says, "it is critical that both the syntax and semantics of the data represented in the composite data structure...", but they don't really address what belongs in the "Basic Element" - the data to be shared. Gaggle solves this problem by explicitly defining a handful of universal data structures. Applications are free to implement their own data model, achieving interoperability by mapping (also in an adapter) their internal data structures onto the shared Gaggle data types.

The Scalable Adapter paper breaks the problem down systematically in terms of design patterns, while Gaggle was motivated by the software engineering strategies of separation of concerns and parsimony, plus the linguistic concept of semantic flexibility. It's remarkable that the two systems worked out quite similarly, given the different domains they were built for.

Friday, March 23, 2012

Applying Semantic Web Services to bioinformatics

Applying Semantic Web Services to bioinformatics: Experiences gained, lessons learnt
Phillip Lord, Sean Bechhofer, Mark D. Wilkinson, Gary Schiltz, Damian Gessler, Duncan Hull, Carole Goble, and Lincoln Stein
International Semantic Web Conference, Vol. 3298 (2004), pp. 350-364, doi:10.1007/b102467

Applying Semantic Web Services to bioinformatics is a 2004 paper on Semantic Web Services in context of bioinformatics, based on the experiences of the myGrid and BioMoby projects. The important and worthy goal behind these projects is enabling composition and interoperability of heterogeneous software. Is Semantic Web technology the answer to data integration in biology? I'm a little skeptical.

Here's a biased selection of what the paper has to say:

"The importance of fully automated service discovery and composition is an open question. It is unclear whether it is either possible or desirable, for all services, in this domain..."
"Requiring service providers and consumers to re-structure their data in a new formalism for external integration is also inappropriate."
"Bioinformaticians are just not structuring their data in XML schema, because it provides little value to them."
"All three projects have accepted that much of the data that they receive will not be structured in a standard way. The obvious corollary of this is that without restructuring, the information will be largely opaque to the service layer."

A couple of interesting asides are addressed:

Most services or operations can be described in terms in inputs and outputs and configuration parameters or secondary input. When building a pipeline, only main input and output need be considered, leaving parameters for later.
A mixed a user base divided between biologists and bioinformaticians is one difficulty noted in the paper. I've also found that tricky. Actually, the situation has changed since the article was written. Point-and-click biologists are getting to be an endangered species. The crop of biologists I see coming up these days is very computationally savvy. What I think of as the scripting-enabled biologist is a lot more common. Those not so enabled are increasingly likely to specialize in wet-lab work and do little or no data analysis.

In BioMOBY Successfully Integrates Distributed Heterogeneous Bioinformatics Web Services. The PlaNet Exemplar Case, (2005) Wilkinson writes,

...interoperability in the domain of bioinformatics is, unexpectedly, largely a syntactic rather than a semantic problem. That is to say, interoperability between bioinformatics Web Services can be largely achieved simply by specifying the data structures being passed between the services (syntax) even without rich specification of what those data structures mean (semantics).

In The Life Sciences Semantic Web is Full of Creeps!, (2006) Wilkinson and co-author Benjamin M. Good write, "both sociological and technological barriers are acting to inhibit widespread adoption of SW technologies," and acknowledge the complexity and high curatorial burden.

The Semantic Web for the Life Sciences (SWLS), when realized, will dramatically improve our ability to conduct bioinformatics analyses... The ultimate goal of the SWLS is not to create many separate, non-interacting data warehouses (as we already can), but rather to create a single, ‘crawlable’ and ‘queriable’ web of biological data and knowledge... This vision is currently being delayed by the timid and partial creep of semantic technologies and standards into the resources provided by the life sciences community.

These days, Mark Wilkinson is working on SADI, which “defines an open set of best-practices and conventions, within the spectrum of existing standards, that allow for a high degree of semantic discoverability and interoperability”.

Wednesday, December 14, 2011

SDCube and hybrid data storage

The Sorger lab at Harvard published a piece in the February 2011 Nature Methods that shows some really clear thinking on the topic of designing data storage for biological data. That paper, Adaptive informatics for multifactorial and high-content biological data, Millard et al., introduces a storage system called SDCubes, short for semantically typed data hypercubes, which boils down to HDF5 plus XML. The software is hosted semanticbiology.com.

This two part strategy applies HDF5 to store high-dimensional numeric data efficiently, while storing sufficient metadata in XML to reconstruct the design of the experiment. This is smart, because with modern data acquisition technology, you're often dealing with volumes of data where attention to efficiency is required. But, experimental design, the place where creativity directly intersects with science, is a rapidly moving target. The only hope of capturing that kind of data is a flexible semi-structured representation.

This approach is very reminiscent, if a bit more sophisticated, than something that was tried in the Baliga lab called EMI-ML, which was roughly along the same lines except that the numeric data was stored in tab-separated text files rather than HDF5. Or to put it another way, TSV + XML instead of HDF5 + XML.

Another ISB effort, Addama (Adaptive Data Management) started as a ReSTful API over a content management system and has evolved into a ReSTful service layer providing authentication and search and enabling access to underlying CMS, SQL databases, and data analysis services. Addama has ambitions beyond data management, but shares with SDCube the emphasis on adaptability to the unpredictable requirements inherent in research by enabling software to reflect the design of individual experiments.

There's something to be said for these hybrid approaches. Once you start looking, you see a similar pattern in lots of places.

NoSQL - SQL hybrids
SQL - XML hybrids. SQL-Server, for one, has great support for XML enabling XQuery and XPath mixed with SQL.
Search engines, like Solr, are typically used next to an existing database
Key/value tables in a database (aka Entity–attribute–value)

Combining structured and semi-structured data allows room for flexibility where you need it, while retaining RDBMS performance where it fits. Using HDF5 adapts the pattern for scientific applications working with vectors and matrices, structures not well served by either relational or hierarchical models. Where does that leave us in biology with our networks?. I don't know whether HDF5 can store graphs. Maybe we need a triple hybrid relational-matrix-graph database.

By the way, HDF5 libraries exist for several languages. SDCube is in Java. MATLAB can read HDF5. There is an HDF5 package for R, but it seems incomplete.

Relational databases work extremely well for some things. But, flexibility has never been their strong point. They've been optimized for 40 years, but shaped by the transaction processing problems they were designed to solve, and they just get awkward for certain uses, to name some - graphs, matrices and frequently changing schemas.

Maybe, before too long the big database vendors go multi-paradigm and we'll see matrices, graphs, key-value pairs, XML and JSON as native data structures that can be sliced and diced and joined at will right along with tables.

More information

Quantitative data: learning to share, an essay by Monya Baker in Nature Methods, describes how "Adaptive technologies are helping researchers combine and organize experimental results."

Monday, September 27, 2010

How to send an HTTP PUT request from R

I wanted to get R talking to CouchDB. CouchDB is a NoSQL database that stores JSON documents and exposes a ReSTful API over HTTP. So, I needed to issue the basic HTTP requests: GET, POST, PUT, and DELETE from within R. Specifically, to get started, I wanted to add documents to the database using PUT.

There's CRAN package called httpRequest, which I thought would do the trick. This wound up being a dead end. There's a better way. Skip to the RCurl section unless you want to snicker at my hapless flailing.

Stuff that's totally beside the point

As Edison once said, "Failures? Not at all. We've learned several thousand things that won't work."

The httpRequest package is very incomplete, which is fair enough for a package at version 0.0.8. They implement only basic get and post and multipart post. Both post methods seem to expect name/value pairs in the body of the POST, whereas accessing web services typically requires XML or JSON in the request body. And, if I'm interpreting the HTTP spec right, these methods mishandle termination of response bodies.

Given this shaky foundation to start with, I implemented my own PUT function. While I eventually got it working for my specific purpose, I don't recommend going that route. HTTP, especially 1.1, is a complex protocol and implementing it is tricky. As I said, I believe the httpRequest methods, which send HTTP/1.1 in their request headers, get it wrong.

Specifically, they read the HTTP response with a loop like one of the following:

repeat{
  ss <- read.socket(fp,loop=FALSE)
  output <- paste(output,ss,sep="")
  if(regexpr("\r\n0\r\n\r\n",ss)>-1) break()
  if (ss == "") break()
}

repeat{
 ss <- rawToChar(readBin(scon, "raw", 2048))
 output <- paste(output,ss,sep="")
 if(regexpr("\r\n0\r\n\r\n",ss)>-1) break()
 if(ss == "") break()
 #if(proc.time()[3] > start+timeout) break()
}

Notice that they're counting on a blank line, a zero followed by a blank line or the server closing the connection to signal the end of the response body. I dunno where the zero thing comes from or why we should count on it not being broken up during reading. Looking through RFC2616 we find this description of an HTTP message:

generic-message = start-line
                  *(message-header CRLF)
                  CRLF
                  [ message-body ]

While the headers section ends with a blank line, the message body is not required to end in anything in particular. The part of the spec that refers to message length lists 5 ways that a message may be terminated, 4 of which are not "server closes connection". None of them are "a blank line". HTTP 1.1 was specifically designed this way so web browsers could download a page and all its images using the same open connection.

For my PUT implementation, I fell back to HTTP 1.0, where I could at least count on the connection closing at the end of the response. Even then, socket operations in R are confusing, at least for the clueless newbie such as myself.

One set of socket operations consists of: make.socket, read.socket/write.socket and close.socket. Of these functions, the R Data Import/Export guide states, "For new projects it is suggested that socket connections are used instead."

OK, socket connections, then. Now we're looking at: socketConnection, readLines, and writeLines. Actually, tons of IO methods in R can accept connections: readBin/writeBin, readChar/writeChar, cat, scan and the read.table methods among others.

At one point, I was trying to use the Content-Length header to properly determine the length of the response body. I would read the header lines using readLines, parse those to find Content-Length, then I tried reading the response body with readChar. By the name, I got the impression that readChar was like readLines but one character at a time. According to some helpful tips I got on the r-help mailing list this is not the case. Apparently, readChars is for binary mode connections, which seems odd to me. I didn't chase this down any further, so I still don't know how you would properly use Content-Length with the R socket functions.

Falling back to HTTP 1.0, we can just call readLines 'til the server closes the connection. In an amazing, but not recommended, feat of beating a dead horse until you actually get somewhere, I finally came up with the following code, with a couple variations commented out:

http.put <- function(host, path, data.to.send, content.type="application/json", port=80, verbose=FALSE) {

  if(missing(path))
    path <- "/"
  if(missing(host))
    stop("No host URL provided")
  if(missing(data.to.send))
    stop("No data to send provided")

  content.length <- nchar(data.to.send)

  header <- NULL
  header <- c(header,paste("PUT ", path, " HTTP/1.0\r\n", sep=""))
  header <- c(header,"Accept: */*\r\n")
  header <- c(header,paste("Content-Length: ", content.length, "\r\n", sep=""))
  header <- c(header,paste("Content-Type: ", content.type, "\r\n", sep=""))
  request <- paste(c(header, "\r\n", data.to.send), sep="", collapse="")

  if (verbose) {
    cat("Sending HTTP PUT request to ", host, ":", port, "\n")
    cat(request, "\n")
  }

  con <- socketConnection(host=host, port=port, open="w+", blocking=TRUE, encoding="UTF-8")
  on.exit(close(con))

  writeLines(request, con)

  response <- list()

  # read whole HTTP response and parse afterwords
  # lines <- readLines(con)
  # write(lines, stderr())
  # flush(stderr())
  # 
  # # parse response and construct a response 'object'
  # response$status = lines[1]
  # first.blank.line = which(lines=="")[1]
  # if (!is.na(first.blank.line)) {
  #   header.kvs = strsplit(lines[2:(first.blank.line-1)], ":\\s*")
  #   response$headers <- sapply(header.kvs, function(x) x[2])
  #   names(response$headers) <- sapply(header.kvs, function(x) x[1])
  # }
  # response$body = paste(lines[first.blank.line+1:length(lines)])

  response$status <- readLines(con, n=1)
  if (verbose) {
    write(response$status, stderr())
    flush(stderr())
  }
  response$headers <- character(0)
  repeat{
    ss <- readLines(con, n=1)
    if (verbose) {
      write(ss, stderr())
      flush(stderr())
    }
    if (ss == "") break
    key.value <- strsplit(ss, ":\\s*")
    response$headers[key.value[[1]][1]] <- key.value[[1]][2]
  }
  response$body = readLines(con)
  if (verbose) {
    write(response$body, stderr())
    flush(stderr())
  }

  # doesn't work. something to do with encoding?
  # readChar is for binary connections??
  # if (any(names(response$headers)=='Content-Length')) {
  #   content.length <- as.integer(response$headers['Content-Length'])
  #   response$body <- readChar(con, nchars=content.length)
  # }

  return(response)
}

After all that suffering, which was undoubtedly good for my character, I found an easier way.

RCurl

Duncan Temple Lang's RCurl is an R wrapper for libcurl, which provides robust support for HTTP 1.1. The paper R as a Web Client - the RCurl package lays out a strong case that wrapping an existing C library is a better way to get good HTTP support into R. RCurl works well and seems capable of everything needed to communicate with web services of all kinds. The API, mostly inherited from libcurl, is dense and a little confusing. Even given the docs and paper for RCurl and the docs for libcurl, I don't think I would have figured out PUT.

Luckily, at that point I found R4CouchDB, an R package built on RCurl and RJSONIO. R4CouchDB is part of a Google Summer of Code effort, NoSQL interface for R, through which high-level APIs were developed for several NoSQL DBs. Finally, I had stumbled across the answer to my problem.

I'm mainly documenting my misadventures here. In the next installment, CouchDB and R we'll see what actually worked. In the meantime, is there a conclusion from all this fumbling?

My point if I have one

HTTP is so universal that a high quality implementation should be a given for any language. HTTP-based APIs are being used by databases, message queues, and cloud computing services. And let's not forget plain old-fashioned web services. Mining and analyzing these data sources is something lots of people are going to want to do in R.

Others have stumbled over similar issues. There are threads on r-help about hanging socket reads, R with CouchDB, and getting R to talk over Stomp.

RCurl gets us pretty close. It could use high-level methods for PUT and DELETE and a high-level POST amenable to web-service use cases. More importantly, this stuff needs to be easier to find without sending the clueless noob running down blind alleys. RCurl is greatly superior to httpRequest, but that's not obvious without trying it or looking at the source. At minimum, it would be great to add a section on HTTP and web-services with RCurl to the R Data Import/Output guide. And finally, take it from the fool: trying to role your own HTTP (1.1 especially) is a fool's errand.

Digithead's Lab Notebook

Tuesday, July 10, 2012

Katy Börner's Plug and Play Macroscopes

CIShell

Means of composition

Visualization

More

Saturday, June 23, 2012

Composition methods compared

Tuesday, May 08, 2012

Design Philosophies of Developer Tools

Principles

Sunday, April 29, 2012

The Scalable Adapter Design Pattern for Interoperability

Friday, March 23, 2012

Applying Semantic Web Services to bioinformatics

More on the Semantic Web

Wednesday, December 14, 2011

SDCube and hybrid data storage

More information

Monday, September 27, 2010

How to send an HTTP PUT request from R

Stuff that's totally beside the point

RCurl

My point if I have one

About

About Me

Blog Archive

Labels

Cheat Sheets

Featured on

Tuesday, July 10, 2012

CIShell

Means of composition

Visualization

More

Saturday, June 23, 2012

Tuesday, May 08, 2012

Principles

Sunday, April 29, 2012

Friday, March 23, 2012

More on the Semantic Web

Wednesday, December 14, 2011

More information

Monday, September 27, 2010

Stuff that's totally beside the point

RCurl

My point if I have one

About

About Me

Blog Archive

Labels

Cheat Sheets

Feedz

Featured on