I posted on RSS yesterday - and Steve Waring linked to it here. Today, I started catching up on the news in my feedlist when I came across this post from Mark Pilgrim:
As I said in last month's article, RSS is an XML-based format for syndicating news and news-like sites. XML was chosen, among other reasons, to make it easier to parse with off-the-shelf XML tools. Unfortunately in the past few years, as RSS has gained popularity, the quality of RSS feeds has dropped. There are now dozens of versions of hundreds of tools producing RSS feeds. Many have bugs. Few build RSS feeds using XML libraries; most treat it as text, by piecing the feed together with string concatenation, maybe (or maybe not) applying a few manually coded escaping rules, and hoping for the best.
On average, at any given time, about 10% of all RSS feeds are not well-formed XML. Some errors are systemic, due to bugs in publishing software. It took Movable Type a year to properly escape ampersands and entities, and most users are still using old versions or new versions with old buggy templates. Other errors are transient, due to rough edges in authored content that the publishing tools are unable or unwilling to fix on the fly. As I write this, the Scripting News site's RSS has an illegal high-bit character, a curly apostrophe. Probably just a cut-and-paste error -- I've done the same thing myself many times -- but I don't know of any publishing tool that corrects it on the fly, and that one bad character is enough to trip up any XML parser
I can certainly feel for this problem. With BottomFeeder
, we have had to catch lots of complaints from the VW XML Parser. We generally let them go by and continue on - because if we just rejected the feeds when they were invalid, there are precious few feeds that we would support. Marks goes on a bit further down:
There is a social solution to this problem: register at Syndic8.com to be a "fixer", and volunteer your time contacting the authors of individual sites to get them to fix their feeds. There is also a technical solution to this problem: don't use an XML parser.
I know, I know, this is heresy. The point of XML is that content producers are supposed to put up with the pain of XML formatting rules so that content consumers can do cool things with off-the-shelf tools. Well, guess what? It's not happening. Judging by the sad state of affairs in the RSS world, content producers are either ignorant of the error of their ways, or too lazy to fix the errors, or too busy, or locked into inflexible tools whose vendors are too busy... Whatever the reasons, content consumers are rarely in a position to solve the problem. So we must work around it. We need a parse-at-all-costs RSS parser.
It's an interesting article with some controversial ideas. I think Mark's right though - we are headed down the road to heck
, where we will end up with tools that deal with just about anything that stumbles by calling itself RSS - not unlike the current HTML mess...