feedparser http://www.feedparser.org/ Because RSS is Hairy Lindsey Smith @turbodog
feedparser: because RSS is hairy RSS formats bundle HTML  User input via HTML  is hairy There are several syndication formats and versions (RSS, Atom, etc.) RSS HTML Micro-format
feedparser: because rss is hairy Download and parse just about any feed type, including:  Various flavors of Atom and RSS Format extensions (iTunes) Micro-formats (GeoRSS, hcard) Ensures that you can treat all feeds the same way, regardless of format or version
feedparser: because rss is hairy Digests whatever crap you throw at it Sanitizes HTML Date normalization Resolving relative links Feed type, version and encoding detection Bozo detection of non-well-formed feeds without blowing up
feedparser: because rss is hairy Parse URL, local file or string data 304 Not Modified  HTTP return code HTTP basic auth Custom request headers Customer handlers Captures response headers
feedparser: the good ol’ days Created circa 2002 by Mark Pilgrim of  Dive Into Python  fame Powers feedvalidator.org v4.1 released in 2007 Open source Well-documented 3000 unit tests Available in popular Linux distros
feedparser: the lean years Development slows to a trickle No official releases Atom & RSS continue to evolve iTunes enclosures v4.1 released in 2007 Still  available in popular Linux distros
feedparser 5.0: a new hope Small group of developers start working on feedparser v5.0 released January 2011 Supports Python 3 Micro-formats CSS & HTML5 sanitation Bug fixes, bug fixes, bug fixes
>>>  import  feedparser  >>> d = feedparser.parse( &quot; http://feedparser.org/docs/examples/atom10.xml &quot; )  >>> d['feed']['title']  # feed data is a dictionary   u'Sample Feed'   >>> d.feed.title  # get values attr-style or dict-style   u'Sample Feed'   >>> d.channel.title  # use RSS or Atom terminology anywhere   u'Sample Feed'   >>> d.feed.link  # resolves relative links   u'http://example.org/'   >>> d.feed.subtitle  # parses escaped HTML   u'For documentation <em>only</em>'
>>> len(d['entries'])  # entries are a list   1   >>> d['entries'][0]['title']  # each entry is a dictionary   u'First entry title'   >>> d.entries[0].title  # attr-style works here too   u'First entry title'   >>> d['items'][0].title  # RSS terminology works here too   u'First entry title'   >>> e = d.entries[0]  >>> e.link  # easy access to alternate link   u'http://example.org/entry/3'   >>> e.links[1].rel  # full access to all Atom links   u'related'   >>> e.links[0].href  # resolves relative links here too   u'http://example.org/entry/3'
>>> e.updated_parsed  # parses all date formats   time.struct_time(tm_year=2005, tm_mon=11, tm_mday=9, tm_hour=11, tm_min=56, tm_sec=34, tm_wday=2, tm_yday=313, tm_isdst=0)   >>> e.content[0].value  # sanitizes dangerous HTML   u'<div>Watch out for <em>nasty tricks</em></div>'   >>> d.version  # reports feed type and version   u'atom10'   >>> d.encoding  # auto-detects character encoding   u'utf-8'   >>> d.headers.get('Content-type')  # full access to all HTTP headers   u'application/xml‘ >>> d.bozo  # well-formed? 0
feedparser: caveats Fairly slow and CPU intensive Friendfeed rolled their own and fell back on feedparser Team is looking at ways to speed it up
feedparser: the project details Home page:  http://www.feedparser.org Discussion:  http://code.google.com/p/feedparser

Feedparser

  • 1.
    feedparser http://www.feedparser.org/ BecauseRSS is Hairy Lindsey Smith @turbodog
  • 2.
    feedparser: because RSSis hairy RSS formats bundle HTML User input via HTML is hairy There are several syndication formats and versions (RSS, Atom, etc.) RSS HTML Micro-format
  • 3.
    feedparser: because rssis hairy Download and parse just about any feed type, including: Various flavors of Atom and RSS Format extensions (iTunes) Micro-formats (GeoRSS, hcard) Ensures that you can treat all feeds the same way, regardless of format or version
  • 4.
    feedparser: because rssis hairy Digests whatever crap you throw at it Sanitizes HTML Date normalization Resolving relative links Feed type, version and encoding detection Bozo detection of non-well-formed feeds without blowing up
  • 5.
    feedparser: because rssis hairy Parse URL, local file or string data 304 Not Modified HTTP return code HTTP basic auth Custom request headers Customer handlers Captures response headers
  • 6.
    feedparser: the goodol’ days Created circa 2002 by Mark Pilgrim of  Dive Into Python  fame Powers feedvalidator.org v4.1 released in 2007 Open source Well-documented 3000 unit tests Available in popular Linux distros
  • 7.
    feedparser: the leanyears Development slows to a trickle No official releases Atom & RSS continue to evolve iTunes enclosures v4.1 released in 2007 Still available in popular Linux distros
  • 8.
    feedparser 5.0: anew hope Small group of developers start working on feedparser v5.0 released January 2011 Supports Python 3 Micro-formats CSS & HTML5 sanitation Bug fixes, bug fixes, bug fixes
  • 9.
    >>> import feedparser >>> d = feedparser.parse( &quot; http://feedparser.org/docs/examples/atom10.xml &quot; ) >>> d['feed']['title'] # feed data is a dictionary u'Sample Feed' >>> d.feed.title # get values attr-style or dict-style u'Sample Feed' >>> d.channel.title # use RSS or Atom terminology anywhere u'Sample Feed' >>> d.feed.link # resolves relative links u'http://example.org/' >>> d.feed.subtitle # parses escaped HTML u'For documentation <em>only</em>'
  • 10.
    >>> len(d['entries']) # entries are a list 1 >>> d['entries'][0]['title'] # each entry is a dictionary u'First entry title' >>> d.entries[0].title # attr-style works here too u'First entry title' >>> d['items'][0].title # RSS terminology works here too u'First entry title' >>> e = d.entries[0] >>> e.link # easy access to alternate link u'http://example.org/entry/3' >>> e.links[1].rel # full access to all Atom links u'related' >>> e.links[0].href # resolves relative links here too u'http://example.org/entry/3'
  • 11.
    >>> e.updated_parsed # parses all date formats time.struct_time(tm_year=2005, tm_mon=11, tm_mday=9, tm_hour=11, tm_min=56, tm_sec=34, tm_wday=2, tm_yday=313, tm_isdst=0) >>> e.content[0].value # sanitizes dangerous HTML u'<div>Watch out for <em>nasty tricks</em></div>' >>> d.version # reports feed type and version u'atom10' >>> d.encoding # auto-detects character encoding u'utf-8' >>> d.headers.get('Content-type') # full access to all HTTP headers u'application/xml‘ >>> d.bozo # well-formed? 0
  • 12.
    feedparser: caveats Fairlyslow and CPU intensive Friendfeed rolled their own and fell back on feedparser Team is looking at ways to speed it up
  • 13.
    feedparser: the projectdetails Home page: http://www.feedparser.org Discussion: http://code.google.com/p/feedparser