XML parsing with Perl

December 8th, 2010 robin Posted in Ironman, Perl | 3 Comments »

XML is a ubiquitous format for data transfer, configurations, and so much more. Processing it can be easy, but it can also be a pain. As always with Perl, there is always more than one way to do it… So I’ve created a framework to test, and compare some of the modules available.

The modules I considered were:

XML::Simple
XML::Smart
XML::Parser (using Tree style)
XML::Twig

My tests take a given XML (a list of films as generated by Mediathek), parse out the individual film entries, and print them out to a csv file. I don’t actually need this, but it’s a typical functionality you might need when working with XML, and allows some good comparisons with both small and huge file sizes.

Considerations:

The main difference between the clever modules (XML::Simple and XML::Smart) and the more grass-roots (XML::Parser and XML::Twig) is the method they use to work through the XML. To be clever, the clever ones suck in the whole XML, analyse it, and create a good hash/array representation of the elements. This has the advantage of having an easy to use format to work with afterwards, but also requires sucking in the whole XML in one go: not a problem for small files, but it quickly becomes dangerous because the system will use 10-20 times the file size in memory: a 60MB file can easily consume 1GB in memory! XML::Twig by contrast works through the file element by element, and so its memory usage is very controllable: it is possible to parse hundreds of megabytes without the perl application expanding beyond 12MB in memory! On the other hand: the clever modules are much easier to use, and you don’t need to know anything about your XML to process it. That makes these modules much more attractive for small (e.g. configuration) files.

For the details of the implementation, please have a look at the code examples:

git clone git@github.com:robin13/rcl-ironman.git

My tests showed these relationships, though your mileage might vary – it depends a lot on the depth/complexity, and variance in size (big on Mondays, small on Thursdays?) of the XML you are dealing with.

By Speed (seconds to process a 66MB file)

XML::Parser   26.5
XML::Simple   76.0
XML::Twig    132.8
XML::Smart   394.9

By Usability (subjective) to implement

XML::Simple     Easy
XML::Twig       Usable
XML::Parser     Complex
XML::Smart      Not so smart...

By Memory consumption (system memory in kB used for a 66MB file)

XML::Twig        972
XML::Parser   506532 (using Tree style)
XML::Simple   628268
XML::Smart   1336604

To summarise:

If you want high performance: XML::Parser
If you want relatively easy, memory efficient parsing of huge files: XML::Twig
If you want easy-to-implement for small files: XML::Simple
If you want to have a bad deal: XML::Smart

What I left out:

Probably a lot of usable modules… If you know of something important, please comment!
Input/Output: XML::Simple and XML::Parser can output XML too. That might be important for you (e.g. config files)!
My code is proof of concept, not highly optimised: feel free to improve, or even add other modules to the tests. Git makes that easy! 🙂

You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Andreas Says:
December 8th, 2010 at 01:27

Hi Robin. Good to know. Maybe worth to note; using event parser for big files, not dom.

Bernhard Schmalhofer Says:
December 10th, 2010 at 08:27

I’m missing the modules XML::LibXML and XML::Compile and XML::LibXML::Simple.

In general I would shy away from using modules based on expat.
The XML-parser libxml2 is much better maintained.

I havn’t yet used XML::LibXML::Simple. But it looks promising as it has the easy interface of XML::Simple and is based on libxml2.

robin Says:
December 10th, 2010 at 08:48

Thanks for the ideas! I’m looking forward to your git pull request with those modules integrated as well! 😛

Robin Clarke – Life and Tech

Profiles

Recent Posts