testing

Smalltalk XML Parsing

December 12, 2005 16:17:06.410

I ran across this page in my aggregator before lunch, and flagged it for followup. That time has come - it must be testing day on this blog. Here's what jumped at me:

I give up on libxml for the time being, and think instead of Chris Petrilli’s comment that ruby (and python) performance is “not quite in the league of Smalltalk (or Lisp, likely), which have extremely mature VMs with on-the-fly compilation and optimization”. Is Smalltalk then much faster than python or ruby, or comparable with C, for the task of parsing moderately large XML files?

No. Time to load and parse my iTunes library file, an 11mb Apple plist, on a 1 GHz G4 Powerbook with VisualWorks Non-Commercial 7.3.1: about three minutes.

That didn't seem right - I use the XML code in VW extensively, so I'm pretty familiar with it. I grabbed my iTunes file (only 2.7 MB) and parsed that - took 5.5 seconds. Well, the two caveats are, that's a smaller file, and my hardware isn't his hardware. With that in mind, I went ahead and created a large XML file. I grabbed the default feed file for BottomFeeder, and saved it as an XML feed list instead of as a binary dump - like this:


file := Tools.XMLConfigFileSupport.XMLConfigFile 
                     filename: 'g:\vw74\image\feeds.xml'.
file saveObject: RSSFeedManager default subscribedFeedsFolder.
file saveConfiguration

That just dumps the 80 sample feeds into a (pretty verbose) XML format - I ended up with a 13 MB file. That seemed large enough, so I tried the parse on that:


content := 'feeds.xml' asFilename contentsOfEntireFile.
parser := XMLParser new.
parser validate: false.
Time millisecondsToRun: [parser parse: content readStream]

That last line times the execution - it ran in 17.9 seconds. Not a couple of seconds, but not 3 minutes, either. There was some GC going on during that, so I'm sure that things could be improved by simply configuring VW with a larger bite of old space up front - in dealing with large amounts of data, a fair bit of time is going to be chewed up either in allocating more memory, or GC'ng if we hit the current limits (as per the memory policy in place).

For this kind of parse to take 3 minutes, either the hardware would have to be very slow, or memory limits would have to be set badly for dealing with larger files. I'm not entirely sure what was going on.

Update: I ran the same code on my Mac Mini - it has a 1.3 Ghz G4 processor, and a paltry 256MB of RAM. The 2.7 MB file parsed in 12.8 seconds, the 13 MB file in 44.7 seconds. Not speedy, but not the 3 minutes reported by Alan Little either - and the Mini is no high end Mac.

Comments

Faster than Java

[Jim Thompson] December 12, 2005 16:32:18.146

We tested VW XML parsing against a couple of Java Frameworks a few years ago and VW was often almost a magnitude faster. In fact just about everything was a magnitude faster and smaller. James, see I can say some nice things about your product

But what code is Alan Little using?

[Isaac Gouy] December 12, 2005 18:49:13.134

but not the 3 minutes reported by Alan Little either

But possibly not the same code that Alan Little is using - maybe it would be nice to ask him what he was doing?

Jeebus

[ James Robertson] December 12, 2005 19:33:23.588

Isaac, stop being a hopeless pedant. He said that it took 3 minutes to parse an 11 MB document. VW ships with a parser, and I seriously doubt that he downloaded VWNC and implemented one himself.

Here's what I did

[Alan Little] December 13, 2005 0:57:15.392

I'm still open to the possibility that I might have been doing something obviously wrong, since it's about 15 years since I wrote Smalltalk code for real. What I did was just:


XML.XMLParser processDocumentInFilename: "filename"
    beforeScanDo: [:p | p validate: false].

as decribed in this Fred Gagne's tutorial on the VW website.

My Powerbook is presumably a bit slower than your Mini and has 768 MB of RAM - a *large* chunk of which was used in this case, but I didn't hear it obviously doing a lot of disk swapping so I don't think I was memory-bound. Apple's plist is a decidedly odd "xml" format too - I wonder if that might behave in a way that's unlike other xml benchmarks?

I can test against your file

[ James Robertson] December 13, 2005 7:31:36.403

Alan, if you are willing, just send me the file (compressed). My email address is on the front page of the blog

Another point in space

[petrilli] December 13, 2005 10:58:12.365

Using James' code (i.e. stuff in memory), I get, using a default NC image with HTTP parcel loaded a runtime of 27.046s to parse. This is on a 13Mb iTunes XML library. If I use what Alan mentions (with the original content deleted and a global GC first), I get 29.7s. Slower, for some reason, but not substantially. This is on a Dell X410 notebook, 1GB RAM and 1.7Ghz PentiumM using 7.3.1NC.

an ounce of curiousity

[Isaac Gouy] December 13, 2005 11:13:53.255

Isaac, stop being a hopeless pedant

an ounce of curiousity is better than a pound of assumption

15 seconds to parase a 6.5 GB iTunes library

[ Troy Brumley] December 13, 2005 12:40:33.824

Comment by Troy Brumley

I copied my library from my Mac over to the PC (D810) and it took 15.401 seconds. The PC was not idle. I note that Chris Petrilli has commented on the Mac VM before, and he feels it performs poorly in a few areas. Jim, your Mac Mini may have a higher bus speed than Alan's Powerbook. I will run a test on my iBook and report results.

More timings, this time on an iBook

[Troy Brumley] December 13, 2005 13:04:42.635

I went ahead and installed the last Winter 05 VW on my iBook and ran against that same 6.5 gig iTunes Library file. In three runs I got between 26.4 and 29.7 seconds, instead of the 15 seconds it took to process on the underclocked PC (Dell D810, Pentium M, 800mhz). This is a 1.2 Ghz G4 iBook with 1.25 gig of RAM. Bus speed 133mhz. Alan, if you are still reading, what are the specs of your Powerbook wrt bus speed and chip caches?

file on the way

[Alan Little] December 13, 2005 13:35:06.328

I'm afraid I don't know the bus & cache specs. It's a 2003 version 1 GHz 12 inch Powerbook.

Bizarre

[petrilli] December 13, 2005 16:41:08.654

I just retested on my 667Mhz G4 Powerbook (Titanium, 2nd generation), with 1Gb RAM, and hit 167 seconds on both the X and native VM. This is very odd.

Specs

[Troy Brumley] December 13, 2005 16:52:24.317

I got my information re bus speed and cache sizes from everymac.com, but the system profiler reports bus speed, iirc.

SystemL1 CacheL2 CacheBus Speed
Alan's Powerbook64 KB512 KB133 Mhz
My iBook64 KB512 KB133 Mhz
Jim's Mac Mini64 KB512 KB167 Mhz

I recall that someone from Heeg felt that graphics would be most impacted by the L2 cache (the VW display problems on Mac OS X). Bus speed is probably the biggest key here, so I hightlighted Jim's Mac Mini. I note that he has said that he finds VW on the Mac "usable" but I do not yet. I'm waiting for the fixed VM, which last I heard won't be part of the Winter 2005 delivery.

X11 VM is faster

[Carl Gundel] December 13, 2005 20:05:51.693

Troy, I finally broke down and started to the the X11 VM on my PowerBook G4 1.5GHz. I resisted this for a while but eventually realized I needed more speed. It did the trick for me. Maybe it'll help you too.

file posted

[Alan Little] December 14, 2005 2:40:08.787

I tried to send my file to James but it didn't make it thorugh cincom's mail filters. So, for James and anybody else who might be curious, it is at http://www.alanlittle.org/private/alanlittle.xml.gz. I won't leave it there for long.

No X11 for me

[Troy Brumley] December 14, 2005 10:00:29.415

I've used that VM in the past, but the feel is so un-Mac-like that I threw it out. I didn't even install it this last go round. Thanks for testing Carl, but I'm waiting for the real deal. I just grabbed Alan's XML file and will post timings of it in a bit.

Different numbers

[Troy Brumley] December 14, 2005 10:43:43.619

I took Alan's file, and ran it four times each of two different ways. The first bit of code that Jim posted looks like this:

Jim's Script:


content := '/users/tbrumley/desktop/alanlittle.xml' asFilename contentsOfEntireFile.
parser := XMLParser new.
parser validate: false.
Time millisecondsToRun: [parser parse: content readStream]

And then a modified version of what Alan reported he ran with looks like this:

Alan's Script:


Time millisecondsToRun: [XML.XMLParser processDocumentInFilename: '/users/tbrumley/desktop/alanlittle.xml'
    beforeScanDo: [:p | p validate: false]].

Alan's timings will be I/O sensitive. This does take longer, but it ranges in at around a minute on my iBook. Here's the timings:

Run # Jim's Script Alan's Script
1 73.439 72.517
2 44.489 54.997
3 46.351 64.469
4 47.638 58.826

There's a lot of variation in Alan's script, but I suspect if I wrapped the whole of Jim's script in a timer, the variation there might increase as well. Checking out the information at everymac.com, Alan and I have different disks/controllers, and it is possible that mine is faster.

Alan, could you do a run using the Jim's script and report your timings?

Here's what I got

[ James Robertson] December 14, 2005 10:55:21.729

Comment by James Robertson

My Windows Box: 20.6 seconds

My Mac Mini: 62 seconds

Like Alan's original test, those include file load time. There's a significant difference across hardware. Just for jollies, I tried it under Linux (Red Hat 7) on an ancient PII 400 Mhz box. That took 2 minutes and 20 seconds.

about a minute for me too now

[] December 14, 2005 16:38:42.463

My Powerbook was clearly having a Strange Day the first time I did this. Perhaps I hadn't rebooted for a few weeks or something.

Anyway, running Jim's code I'm now getting around 5-6 seconds to load, 52-55 seconds to parse.

Times for my code seem a lot more variable. 62 fastest, but also a couple of runs around 80-82.

I'm also getting a very favourable impression of how helpful, constructive and non-flaming the Smalltalk community is. Thanks guys.

 Share Tweet This
-->