badly formed
This post from BitWorking is everything you need to know about XML in one easy package. Here's what I mean:
When I tried to load it into an XML parser it failed to load. As it turns out the file was riddled with character set encoding problems, in particular quote marks. After much hand tweaking I finally have it in shape where a real XML parser can open in up. Now I can get on with the job of importing the data into pyblosxom. It isn't supposed to be this hard.
The real value of XML is interop and the currency of that interop is syntax as expressed in the term "well-formed".
I love this theory from the XML crowd. The theory being, "If only all XML data were well formed, our parsers would work". Well, deal with reality - not all XML will be well formed. Ever. There will always be crap out there. Some of it will be crap that people want to read. So, do you do what the advocates say - auto-reject the bad stuff and keep your parser "pure"? Or do what a sensible person does - have your parser deal with errors gracefully (logging them in some fashion so that you have a shot at notifying the producer) and moving on? The Atom advocates are still in the navel gazing phase of imagining a perfect world of all well formed XML. The rest of us who live in the real world have accepted the reality of bad content and moved on.


Comments
Re: badly formed
[ Michael Lucas-Smith] June 20, 2004 17:51:38.337
Comment on badly formed by Michael Lucas-Smith
When it's company dealing with company, you can expect well formed XML. When it's some public offering made by joe blogs.. well.. good luck :)
Smalltalk Tidbits, Industry Rants: badly formed
[42] June 20, 2004 19:14:09.269
Trackback from 42
Smalltalk Tidbits, Industry Rants: badly formed
Between Smalltalk Tidbits, Industry Rants: badly formed and The real value of XML is wellformedness | 2004-06-20 | BitWorking we seem to have a faily fundemental disagreement. Yet to me it is not so simple (seems I am in a...
well-formed is good
[Juergen] June 20, 2004 21:26:32.819
actually, i like the well-formedness.
it gives me either the (more-or-less) expected result, or the parser barfs.
admittedly, i never even tried to handle third-party xml off the wild wild web. what i use it for is to store stuff in databases or to post-process xhtml fragments coming in from said databases (or from somewhere else). i find having all data well formed at all times makes things much easier. it gives you an error when the data enters the system, not months after the fact when you have to make sense of some free-form ad-hoc encoding scheme. also, i am trying very hard not to mention html or sgml here ... but the ease of parsing and lack of ambiguity (is this a word? not my native language) makes xml very attractive to me.
and while i am on a rant, let me mention that i hate xsl(t). dsssssssl (i just can't stop spelling it) was better than that. half an hour's effort in perl (and i hate perl) could give you something better than xsl. but then i guess the concept is flawed. they try to give you the flexibility of an actual programming language without giving you an actual programming language. "but it's xml parseable". so what?
everybody think of clowns and glowfish now. go on, there is nothing to see here
bryan
[bry@itnisk.com] June 21, 2004 8:38:13.469
the following is my poorly formed xml, please resolve it:
ebh9hjà BN M´32QERAG
Can't do it?
[bry@itnisk.com] June 21, 2004 8:42:58.794
Obviously there is a level in which no malformed xml can be reformed, the problem is that in reforming you are also running a risk of reforming wrong (the chance of a document being well-formed but still well-formed incorrectly strikes me as being less than the chance of a document being malformed and when you reform it you reform it incorrectly), depending on your application, and on how mission-critical data is, reforming can be highly problematic. I don't see a problem with a feed processor going ahead and logging errors and proceeding, but the way you have phrased your argument seems to suppose that what works reasonably well for your domain is the way that should be used for all domains using xml, which is at the moment causing me to have a very bad reaction to this bad coffee I'm drinking.
Malfofmedness is the problem; not the solution
[Elliotte Rusty Harold] June 21, 2004 10:41:44.935
If you actually read the post, it becomes apparent that the problem arose due to software from a developer with a track record of encouraging malformed XML and an inability to deal properly with character encoding issues. If this developer had used a real XML parser in the first place, rather than hacking together some poor code based on string manipulations, the problem would have been caught early and fixed. Checking for and requiring well-formedness is a critical part of making sure broken systems fail as early and as fast as possible, where it's easiest and cheapest to fix them.
If RSS parsers, editors, viewers, and other tools all required well-formedness, there would be a lot less malformed RSS in the world, and what little did escape into the wild qould quickly be detected and fixed. As it is the problem feeds on itself in a vicious cycle.The more malformed XML is accepted, the more malformed XML is generated, and the more malformed XML must be accepted. This only increases costs and trouble for everyone to nobody's benefit. This is exactly the sort of problem draconian well-formedness checking was designed to prevent; and it is the sort of problem only draconian well-formednes checking can fix.
Re: badly formed
[ James Robertson] June 21, 2004 11:04:42.589
Comment on badly formed by James Robertson
Elliotte,
You say:
In the feed arena, we are talking non-critical, consumer oriented data. Which means that tools will naturally tend towards compensating for bad data. In the B2B space, I rather expect that things would tilt the other way - reject any malformation, because it's not appropriate to "guess". In the consumer space, with content meant for reading, it's entirely appropriate to guess and do whatever you can to present information to the end user.
They've standardized failure
[Jon H] June 22, 2004 23:58:23.905
I guess it can be considered an improvement.
XML, like other formats, can fail due to garbled or badly entered data. But with other formats, each format will fail in its own special way.
At least with XML, the errors are more predictable, and error handling can be more easily standardized, with some of it done by libraries you didn't have to write.