On the Atom mailing list, there's been a lot of talk recently about what should/should not be done with malformed feeds. The answer probably differs based on context:
- In a b2b context, you probably want to reject malformed XML data. This isn't an appropriate place to make a "best guess" and move along
- In a consumer context (i.e., the one most news aggregators live in), it's reasonable to flag the bad data (so that a user who cares can report it) and try to present it anyway.
The difference is context - if it's a business level communication, then guessing isn't appropriate. If, on the other hand, I'm trying to find out what the latest baseball scores are, then I don't really care about the stray Unicode character that wandered into a feed.
The truly interesting piece is the stats that Mark Pilgrim dug up:
I analyzed 5096 RSS and Atom feeds chosen at random from Syndic8.com and parsed them with Universal Feed Parser 3.0.1 using the latest version of libxml2 as the underlying XML parser.
Actually, I analyzed more feeds than that, but I threw away feeds that
- didn't either return an HTTP status code 200 or redirect to a URL that returned 200, or
- didn't have a recognizable root-level element of some version of RSS or Atom
- 3929 feeds (77.10%) were well-formed.
- 961 feeds (18.86%) were not well-formed due to specifying "Content-Type: text/xml" but containing non-us-ascii characters.
- 206 feeds (4.04%) were not well-formed for other reasons.
Nearly a quarter of the feeds chosen (and likely this holds across all feeds) have issues - and they have issues that a tighter spec is not going to solve. We've crossed the Rubicon on this one, at least in the consumer space....