Coded Structure: http

Monday, 18 July 2011

HTTP or it doesn't exist.

This is my 'or it doesn't exist' blog post[1][2][3]. I think everyone should have one ;-)

A big chunk of my life is processing electronic information. Since I would like it to be a (slightly) smaller chunk of my life, I want to automate as much as possible. Now ideally, I don't want a massive disconnect between what I have to do as a human processor of information and what I need to tell a computer to do to do that job done without my help. Because it's easier that way.

So when I hear that the information I need to process is in some spreadsheet or other on a Windows share, it makes me a little sad. When I hear that it is available via a sensible REST interface in a sensible format, my heart leaps for joy just a little.

With something like Python's standard library (and third-party package) support for HTTP (requests), XML (ElementTree) and JSON, I should be able to get my computer to do most of the manual data processing tasks which involve 'documents' of some form or other.

In a previous job I worked at convincing anyone who would listen that 'XML over HTTP' was the best thing since sliced bread. With appropriate XSLT and CSS links, the same data source (i.e. URI) could be happily consumed by both man and machine. Admittedly most of the information was highly structured data - wire protocols and the like, but it still needed to be understandable by real people and real programs.

I'm not an XML expert, but I think I 'get' it. I never understood why it needed so much baggage though, and can't say I'm sad that the whole web services thing seems to be quietly drifting into the background - though maybe it was always trying to.

A lot changes in web technology in a short time, and XML is no longer 'cool', so I won't be quite as passionate about 'XML over HTTP' as I once was. For short fragments it is far more verbose than JSON, though I'd argue that for longer documents, XML's added expressiveness makes the verbosity worth it. Maybe it was ever thus, but whenever two technologies have even the slightest overlap, there seems to be a territorial defensiveness which makes the thought of using both in one project seem somewhat radical. So while I've used JSON much more than XML in the last couple of years, I've not turned against it. If done right (Apple, what were you thinking with plist files!?) - it is great. Compared to JSON-like representations, the ability to have attributes for every node in the tree is a whole new dimension in making a data source more awesome or usable (or terribly broken and rubbish). I've seen too many XML documents where either everything is an attribute or nothing is, but it's not exactly rocket science.

Things I liked about XML:

Simplicity: I like to think I could write a parser for XML 1.0 without too much effort. If it's not well formed, stop. Except for trivial whitespace normalisation etc, there is a one-to-one mapping of structure to serialisation. Compare that with the mess of HTML parsers. While HTML5 might now specify how errored documents should be parsed (i.e. what the resulting DOM should be), I suspect that a HTML5 -> DOM parser is a far more complex beast.
Names! Sensible Names!: Because HTML is limited in its domain, it has a fixed (though growing thanks to the living standard[4] which is HTML) set of tags. When another domain is imposed on top of that, the class attribute tends to get pressed into service in a ugly and overloaded way. By allowing top-level tags to be domain-specific, we can make the document abstraction more 'square'[5].
Attributes: Attributes allow metadata to be attached to document nodes. Just as a lower-level language is fully capable of creating a solution to any given problem, having 'zero mental cost' abstractions (such as the data structures provided by high-level languages) enables new ways of thinking about problems. In the same way, having attributes on data nodes doesn't give us anything we couldn't implement without them, but it provides another abstraction which I've found invaluable and missed when using or creating JSON data sources.

What does make me slightly(!) sad though is the practical demise of XHTML and any priority that browsers might give to processing XML. There is now a many-to-one mapping of markup to DOM, and pre HTML5 (and still in practice for the foreseeable future considering browser idiosyncrasies and bugs) - a many-to-many mapping. It wouldn't surprise me if XSLT transform support eventually disappeared from browsers.

Maybe there's a bit of elitism here - if you can't code well-formed markup and some decent XSLT (preferably with lots of convoluted functional programming thrown in) - then frankly 'get of my lawn!'. I love the new features in HTML(5), but part of me wishes that there was an implied background 'X' unquestionably preceding that, for all things. The success of the web is that it broke out of that mould. But in doing that it has compromised the formalisms which machines demand and require. Is the dream of the machine-readable semantic web getting further away - even as cool and accessible (and standards compliant - at last) web content finally looks like it might possibly start to achieve its goal? Is it too much (and too late) to dream of 'data' (rather than ramblings like this one) being available in the same form for both the human viewer and the computer automaton?

I'm prepared to be realistic and accept where we've come to. It's not all bad, and the speed with which technology is changing has never been faster. It's an exciting time to wield electronic information, and we've got the tools to move forward from inaccessible files stored on closed, disconnected systems. So where I used to say 'XML over HTTP', my new mantra shall now be 'HTTP or it doesn't exist'. At least for a while.

Friday, 31 December 2010

HTTP 'streaming' from Python generators

One of the more annoying things about HTTP is that it wants to send things in complete chunks: you ask for an object with a particular URL, and some point later you get that object. There isn't a lot you can do (at least from JavaScript) until the complete resource has loaded. For more fine grained control, it's a bit annoying.

Of course web sockets will solve all of this (maybe) once the spec has gone through the mill a few more times. But in the mean-time there is often an impedance mismatch between how we'd like to be able to do things on the server-side and how we are forced to do them because of the way HTTP works.

The following is an example of one way to manage splitting things up. It allows Python generators to be used on the server side and sends an update to the client on every yield, with the client doing long-polling to get the data. This shouldn't be confused with CherryPy's support for yield streaming single responses back to the server (which is discouraged) - the yield functionality is hijacked for other purposes if the method decorator is applied. Also note that this is only of any help for clients which can use AJAX to repeatedly poll the generator.

Example

Let's suppose that we want to generate a large (or infinite) volume of data and send it to a web client. It could be a long text document served line-by-line. But let's use the sequence of prime numbers (because that's good enough for aliens). We want to send it to the client, and have it processed as it arrives. The principle is to use a generator on the server side rather than a basic request function, but wrap that in something which translates the generator into a sequence of responses, each serving one chunk of the response.

Server implementation using CherryPy - note the json_yield decorator.

import cherrypy

class PrimeGen(object):
    @cherrypy.expose
    def index(self):
        return INDEX_HTML # see below

    @cherrypy.expose
    @json_yield   # see below
    def prime(self):
        # this isn't supposed to be efficient.
        probe = 1
        while True:
           for i in range(2, probe):
               if probe % i == 0:
                   break
           else:
               yield probe
           probe += 2

cherrypy.quickstart(PrimeGen)

The thing which turns this generator into something usable with long-polling is the following 'json_yield' decorator.

Because we might want more than one such generator on the server (not to mention generator instances from multiple clients), we need a key - passed in from the client - which associates a particular client with the generator instance. This isn't really handled in this example, see the source file download at the end of the post for that.

The major win is that the client doesn't have to store a 'next-index' or anything else. State is stored implicitly in the Python generator on the server side. Both client and server code should be simpler. Of course this goes against REST principles, where one of the fundamental tenets is that state should be stored only on the Client. But there is a place for everything.

import functools
import json

def json_yield(fn):
    # each application of this decorator has its own id
    json_yield._fn_id += 1

    # put it into the local scope so our internal function
    # can use it properly
    fn_id = json_yield._fn_id

    @functools.wraps(fn)
    def _(self, key, *o, **k):
        """
        key should be unique to a session.
        Multiple overlapping calls with the same 
        key should not happen (will result in
        ValueError: generator already executing)
        """

        # create generator if it hasn't already been
        if (fn_id,key) not in json_yield._gen_dict:
            new_gen = fn(self, *o, **k)
            json_yield._gen_dict[(fn_id,key)] = new_gen

        # get next result from generator
        try:
           # get, assuming there is more.
           gen = json_yield._gen_dict[(fn_id, key)]
           content = gen.next()
           # send it
           return json.dumps({'state': 'ready',
                              'content':content})
        except StopIteration:
           # remove the generator object
           del json_yield._gen_dict[(fn_id,key)]
           # signal we are finished.
           return json.dumps({'state': 'done',
                              'content': None})
    return _
# some function data...
json_yield._gen_dict = {}
json_yield._fn_id = 0

The HTML to go with this is basic long-polling, and separating out the state from the content. Here I'm using jQuery:

INDEX_HTML = """
<html>
<head>
<script src="http://code.jquery.com/jquery-1.4.4.min.js">
</script>
<script>
$(function() {

  function update() {
    $.getJSON('/prime?key=1', {}, function(data) {
      if (data.state != 'done') {
        $('#status').text(data.content);
        //// alternative for appending:
        // $('#status').append($(''+data.content+''));
        setTimeout(update, 0);
      }
    });
  }

  update();
});
</script>
</head>
<body>
<div id="status"></div>
</body>
</html>
"""

Uses

The example above is contrived, but there are plenty of possibilities if the json_yield decorator is extended in a number of ways. Long running server side processes can send status information back to the client with minimal hassle. Client-side processing of large text documents can begin before they have finished downloading. One issue is that a chunk of the data should be semantically understandable. Using things like this on binary files or XML (where it is only valid once the root element is closed) won't have sensible results.

There are plenty of possibilities in extending this; the decorator could accumulate the content (rather than the client) and send the entire results up-to-now back, or (given finite memory) some portion of it using a length-limited deque. Additional meta-data (e.g. count of messages so far, or the session key) could be added to the JSON information sent to the client each poll.

Disclaimer: I've not done any research into how others do this, because coding is more fun than research. There are undoubtedly better ways of accomplishing similar goals. In particular there are issues with memory usage and timeouts which aren't handled with json_yield. Also note that the example is obviously silly - it would be much faster to compute the primes on the client.

Download Files

Coded Structure

Labels

Monday, 18 July 2011

HTTP or it doesn't exist.

Friday, 31 December 2010

HTTP 'streaming' from Python generators

Example

Uses

About Me

Pages

Blog Archive

Search This Blog