Coded Structure: automation

Continuing my pursuit of having a custom vocabulary on my command line, preferably via argv[0] abuse, I've now addressed the subject of web searching. I probably make somewhere between 20 and 50 searches in a typical day, mostly on Google, but Wikipedia comes high up on the list too.

urlsearch is a small script which kicks off a browser search from the command line. The plan is that the task switch associated with switching from the command line (where you usually are, right?) to the browser is eliminated. By complex, spurious and - to-be-frank - non-existent calculations, I estimate that this reduction in friction getting your work done should make you 4.6% more productive, and thus make the world a better place.

Using Python's webbrowser module, it's straightforward to open a webbrowser to a particular page:

>>> import webbrowser
>>> webbrowser.open('http://google.com/search?q=standard+library+pep8')

What urlsearch gives is the equivalent to the above from the following at a Bash prompt:

$ google standard library pep8

It's simple, short, and in it's basic form is just a couple of lines of Python:

#!/usr/bin/env python
import sys, urllib, webbrowser

webbrowser.open('http://google.com/search?q=' +
                urllib.quote_plus(' '.join(sys.argv[1:])))

Make that executable, put it in the path, and you're good-to-go with google [1] searching from the command line. However, as always, complexity is lurking, and desires to have it's way...

The following things are addressed in urlsearch:

Automatic gTLD checking

A range of gTLDs are searched in turn using socket.getaddrinfo(HOSTNAME, 'http'). By default this list starts with the empty gTLD (so local search domains are tried first), then .com, .org, .net, and .co.uk are tried in that order - these being most relevant to my uses. Changing it to default to '.fr' first might be reasonable for French-speakers, for example, but avoiding having to think about this is one more thing not to have to think about. As it were.

Generic search engine support

This is where the wonder of argv[0] fits in :-) Via symlinks to urlsearch, various search engines can be supported. An argv[0] of 'google' will cause a google.com search, while 'wiki' is special-cased to wikipedia. The search query format also needs special-casing for many search engines - the default of /search?q={terms} works for Google, Bing and several other sites.

The following sites are directly supported or special cased:

argv[0]	search engine
google	Google
bing	Bing
wiki	Wikipedia
duckduckgo	DuckDuckGo
pylib	Python standard libraries (direct jump)
jquery	jQuery API search (direct jump)

These are managed in the code by a very dull if/elif chain, though something a bit less 'hackish' would probably be wanted to scale to further engines.

Trac support

Trac [2] follows the same search query format as Google, and has a great 'quickjump' feature, where certain search query formats take the user directly to the relevant page. For example, a search for r5678 will go directly to the changeset for revision 5678, and a search for #1234 will go directly to ticket 1234. This ticket search can't be done from a Bash prompt however, as it will be treated as a comment and ignored. This is special cased such that if the search term is an integer, it will be preceded with '#'.

Other tweaks

Output from the browser writing into the command prompt (as happens with Chrome, for example) is redirected to /dev/null.

The Code.

The code is here: http://bitbucket.org/codedstructure/urlsearch

[1]	Other search vendors are available

[2]	http://trac.edgewall.org

This is my 'or it doesn't exist' blog post[1][2][3]. I think everyone should have one ;-)

A big chunk of my life is processing electronic information. Since I would like it to be a (slightly) smaller chunk of my life, I want to automate as much as possible. Now ideally, I don't want a massive disconnect between what I have to do as a human processor of information and what I need to tell a computer to do to do that job done without my help. Because it's easier that way.

So when I hear that the information I need to process is in some spreadsheet or other on a Windows share, it makes me a little sad. When I hear that it is available via a sensible REST interface in a sensible format, my heart leaps for joy just a little.

With something like Python's standard library (and third-party package) support for HTTP (requests), XML (ElementTree) and JSON, I should be able to get my computer to do most of the manual data processing tasks which involve 'documents' of some form or other.

In a previous job I worked at convincing anyone who would listen that 'XML over HTTP' was the best thing since sliced bread. With appropriate XSLT and CSS links, the same data source (i.e. URI) could be happily consumed by both man and machine. Admittedly most of the information was highly structured data - wire protocols and the like, but it still needed to be understandable by real people and real programs.

I'm not an XML expert, but I think I 'get' it. I never understood why it needed so much baggage though, and can't say I'm sad that the whole web services thing seems to be quietly drifting into the background - though maybe it was always trying to.

A lot changes in web technology in a short time, and XML is no longer 'cool', so I won't be quite as passionate about 'XML over HTTP' as I once was. For short fragments it is far more verbose than JSON, though I'd argue that for longer documents, XML's added expressiveness makes the verbosity worth it. Maybe it was ever thus, but whenever two technologies have even the slightest overlap, there seems to be a territorial defensiveness which makes the thought of using both in one project seem somewhat radical. So while I've used JSON much more than XML in the last couple of years, I've not turned against it. If done right (Apple, what were you thinking with plist files!?) - it is great. Compared to JSON-like representations, the ability to have attributes for every node in the tree is a whole new dimension in making a data source more awesome or usable (or terribly broken and rubbish). I've seen too many XML documents where either everything is an attribute or nothing is, but it's not exactly rocket science.

Things I liked about XML:

Simplicity: I like to think I could write a parser for XML 1.0 without too much effort. If it's not well formed, stop. Except for trivial whitespace normalisation etc, there is a one-to-one mapping of structure to serialisation. Compare that with the mess of HTML parsers. While HTML5 might now specify how errored documents should be parsed (i.e. what the resulting DOM should be), I suspect that a HTML5 -> DOM parser is a far more complex beast.
Names! Sensible Names!: Because HTML is limited in its domain, it has a fixed (though growing thanks to the living standard[4] which is HTML) set of tags. When another domain is imposed on top of that, the class attribute tends to get pressed into service in a ugly and overloaded way. By allowing top-level tags to be domain-specific, we can make the document abstraction more 'square'[5].
Attributes: Attributes allow metadata to be attached to document nodes. Just as a lower-level language is fully capable of creating a solution to any given problem, having 'zero mental cost' abstractions (such as the data structures provided by high-level languages) enables new ways of thinking about problems. In the same way, having attributes on data nodes doesn't give us anything we couldn't implement without them, but it provides another abstraction which I've found invaluable and missed when using or creating JSON data sources.

What does make me slightly(!) sad though is the practical demise of XHTML and any priority that browsers might give to processing XML. There is now a many-to-one mapping of markup to DOM, and pre HTML5 (and still in practice for the foreseeable future considering browser idiosyncrasies and bugs) - a many-to-many mapping. It wouldn't surprise me if XSLT transform support eventually disappeared from browsers.

Maybe there's a bit of elitism here - if you can't code well-formed markup and some decent XSLT (preferably with lots of convoluted functional programming thrown in) - then frankly 'get of my lawn!'. I love the new features in HTML(5), but part of me wishes that there was an implied background 'X' unquestionably preceding that, for all things. The success of the web is that it broke out of that mould. But in doing that it has compromised the formalisms which machines demand and require. Is the dream of the machine-readable semantic web getting further away - even as cool and accessible (and standards compliant - at last) web content finally looks like it might possibly start to achieve its goal? Is it too much (and too late) to dream of 'data' (rather than ramblings like this one) being available in the same form for both the human viewer and the computer automaton?

I'm prepared to be realistic and accept where we've come to. It's not all bad, and the speed with which technology is changing has never been faster. It's an exciting time to wield electronic information, and we've got the tools to move forward from inaccessible files stored on closed, disconnected systems. So where I used to say 'XML over HTTP', my new mantra shall now be 'HTTP or it doesn't exist'. At least for a while.

Coded Structure

Labels

Monday, 7 May 2012

urlsearch - web searching from the command line

Automatic gTLD checking

Generic search engine support

Trac support

Other tweaks

The Code.

Monday, 18 July 2011

HTTP or it doesn't exist.

About Me

Pages

Blog Archive

Search This Blog