I give up on libxml for the time being, and think instead of Chris Petrilli’s comment that ruby (and python) performance is “not quite in the league of Smalltalk (or Lisp, likely), which have extremely mature VMs with on-the-fly compilation and optimization”. Is Smalltalk then much faster than python or ruby, or comparable with C, for the task of parsing moderately large XML files?
No. Time to load and parse my iTunes library file, an 11mb Apple plist, on a 1 GHz G4 Powerbook with VisualWorks Non-Commercial 7.3.1: about three minutes.
That didn't seem right - I use the XML code in VW extensively, so I'm pretty familiar with it. I grabbed my iTunes file (only 2.7 MB) and parsed that - took 5.5 seconds. Well, the two caveats are, that's a smaller file, and my hardware isn't his hardware. With that in mind, I went ahead and created a large XML file. I grabbed the default feed file for BottomFeeder, and saved it as an XML feed list instead of as a binary dump - like this:
file := Tools.XMLConfigFileSupport.XMLConfigFile filename: 'g:\vw74\image\feeds.xml'. file saveObject: RSSFeedManager default subscribedFeedsFolder. file saveConfiguration
That just dumps the 80 sample feeds into a (pretty verbose) XML format - I ended up with a 13 MB file. That seemed large enough, so I tried the parse on that:
content := 'feeds.xml' asFilename contentsOfEntireFile. parser := XMLParser new. parser validate: false. Time millisecondsToRun: [parser parse: content readStream]
That last line times the execution - it ran in 17.9 seconds. Not a couple of seconds, but not 3 minutes, either. There was some GC going on during that, so I'm sure that things could be improved by simply configuring VW with a larger bite of old space up front - in dealing with large amounts of data, a fair bit of time is going to be chewed up either in allocating more memory, or GC'ng if we hit the current limits (as per the memory policy in place).
For this kind of parse to take 3 minutes, either the hardware would have to be very slow, or memory limits would have to be set badly for dealing with larger files. I'm not entirely sure what was going on.
Update: I ran the same code on my Mac Mini - it has a 1.3 Ghz G4 processor, and a paltry 256MB of RAM. The 2.7 MB file parsed in 12.8 seconds, the 13 MB file in 44.7 seconds. Not speedy, but not the 3 minutes reported by Alan Little either - and the Mini is no high end Mac.