Feedparser

feedparser http://www.feedparser.org/ Because RSS is Hairy Lindsey Smith @turbodog

feedparser: because RSS is hairy RSS formats bundle HTML User input via HTML is hairy There are several syndication formats and versions (RSS, Atom, etc.) RSS HTML Micro-format

feedparser: because rss is hairy Download and parse just about any feed type, including: Various flavors of Atom and RSS Format extensions (iTunes) Micro-formats (GeoRSS, hcard) Ensures that you can treat all feeds the same way, regardless of format or version

feedparser: because rss is hairy Digests whatever crap you throw at it Sanitizes HTML Date normalization Resolving relative links Feed type, version and encoding detection Bozo detection of non-well-formed feeds without blowing up

feedparser: because rss is hairy Parse URL, local file or string data 304 Not Modified HTTP return code HTTP basic auth Custom request headers Customer handlers Captures response headers

feedparser: the good ol’ days Created circa 2002 by Mark Pilgrim of Dive Into Python fame Powers feedvalidator.org v4.1 released in 2007 Open source Well-documented 3000 unit tests Available in popular Linux distros

feedparser: the lean years Development slows to a trickle No official releases Atom & RSS continue to evolve iTunes enclosures v4.1 released in 2007 Still available in popular Linux distros

feedparser 5.0: a new hope Small group of developers start working on feedparser v5.0 released January 2011 Supports Python 3 Micro-formats CSS & HTML5 sanitation Bug fixes, bug fixes, bug fixes

>>> import feedparser >>> d = feedparser.parse( " http://feedparser.org/docs/examples/atom10.xml " ) >>> d['feed']['title'] # feed data is a dictionary u'Sample Feed' >>> d.feed.title # get values attr-style or dict-style u'Sample Feed' >>> d.channel.title # use RSS or Atom terminology anywhere u'Sample Feed' >>> d.feed.link # resolves relative links u'http://example.org/' >>> d.feed.subtitle # parses escaped HTML u'For documentation <em>only</em>'

>>> len(d['entries']) # entries are a list 1 >>> d['entries'][0]['title'] # each entry is a dictionary u'First entry title' >>> d.entries[0].title # attr-style works here too u'First entry title' >>> d['items'][0].title # RSS terminology works here too u'First entry title' >>> e = d.entries[0] >>> e.link # easy access to alternate link u'http://example.org/entry/3' >>> e.links[1].rel # full access to all Atom links u'related' >>> e.links[0].href # resolves relative links here too u'http://example.org/entry/3'

>>> e.updated_parsed # parses all date formats time.struct_time(tm_year=2005, tm_mon=11, tm_mday=9, tm_hour=11, tm_min=56, tm_sec=34, tm_wday=2, tm_yday=313, tm_isdst=0) >>> e.content[0].value # sanitizes dangerous HTML u'<div>Watch out for <em>nasty tricks</em></div>' >>> d.version # reports feed type and version u'atom10' >>> d.encoding # auto-detects character encoding u'utf-8' >>> d.headers.get('Content-type') # full access to all HTTP headers u'application/xml‘ >>> d.bozo # well-formed? 0

feedparser: caveats Fairly slow and CPU intensive Friendfeed rolled their own and fell back on feedparser Team is looking at ways to speed it up

feedparser: the project details Home page: http://www.feedparser.org Discussion: http://code.google.com/p/feedparser

Feedparser

More Related Content

Similar to Feedparser

Recently uploaded

Feedparser