Monday, 23 November 2015
Recently, while writing an internal protocol, I found myself prospecting for data formats. I'd usually opt for JSON as a great all rounder, being reasonably concise, human-readable, and widely supported, but this time I could afford to dig around for alternatives. After all, JSON makes passable config files, but I'd rather use TOML (which I also wrote a Scheme module for) given the right environment. The same can be true with data, and one of the alternative formats I uncovered, Bencode (pronounced B-Encode), turns out to be pretty neat.
Bencoding is used to format structured data in torrent files. It uses ASCII characters as delimiters and digits, making it easy to debug, and being unaffected by endianness it's easy to support cross-platform. Unlike the more space efficient Protocol Buffers, Bencode does not require a schema definition to parse, a restriction that makes incremental updates painful, it includes all the relevant keys in the message itself. Bencode can safely include binary data too, and as you'll see, is incredibly easy to serialize and parse.
In Bencode, strings are almost identical to Netstrings, a delightfully simple format for bytestrings, where the value is prefixed by it's length in bytes:
12:hello world! => "hello world!"
This means no character escaping is necessary (eg, NULL characters), and when parsing, the memory required is known in advance. In addition to bytestrings, Bencode provides integers, lists and dictionaries.
Integers use ASCII digits delimited by i and e:
i123e => 123
Lists are just multiple bencoded values surrounded by l and e:
l5:jelly4:cake7:custarde => ["jelly", "cake", "custard"]
Similarly, dictionaries use d and e to wrap a sequence of keys and values:
d4:name5:cream5:pricei100ee => {"name": "cream", "price": 100}
Dictionary keys must be sorted in lexicographical order, meaning there is only one valid bencoding for a given (possibly complex) value. So bencoded values can be compared or hashed directly without the need to decode.
The simplicity of this system means it's easy to produce correct bencoded messages with minimal effort, even using printf() and friends, a very handy property in embedded environments. It also means a new parser can be added quickly when required, which is exactly what I did for CHICKEN Scheme, implementing a bencode module in very little time:
(use bencode) (bencode->string '(123 456)) ;; => "li123ei456ee" (string->bencode "12:hello world!") ;; => "hello world!"
One advantage to Bencode's simple rules, apart from being easy to implement, is that it's easy to process. Parsing and serializing bencoded data is generally faster than the alternatives I've tried, not because of my great parser code, just as a result of a well considered encoding.
I made a quick test, decoding and then encoding some example data 100,000 times.
This is not chosen to reflect any common usage or benefit any particular parser, I'm just bashing on the keyboard here…
{
"foo": 123,
"bar": {
"baz": [1, 2, 3, 4, "hello"]
}
"qux": "This is a much longer string containing\nnew\nlines\netc.",
"wibble": [123123, "foo", "bar", "baz", 123124, [1 "a"], ["b" 4], ["cd"]]
}
bencode 171 bytes json 176 bytes tagged netstring 205 bytes msgpack 129 bytes
bencode 2.274s json 16.886s tagged netstring 4.986s msgpack 7.98s
bencode 2.513s json 5.444s tagged netstring 4.903s msgpack 4.254s
So, that's Bencode, a neat little format I hadn't noticed before. Other JSON alternatives you might consider are: MessagePack, a more complex binary format with better size efficiency though not human-readable, and Tagged Netstrings a backwards compatible extension to Netstrings developed by Zed Shaw as part of the Mongrel2 webserver project. Tagged Netstrings support more data types than Bencode but may not work well for streams since type information is located at the end of each value, not the start. Of course, most of the time I'll still be using JSON, and there's nothing wrong with that!