XML parsing with Perl

December 8th, 2010 robin Posted in Ironman, Perl | 3 Comments »

XML is a ubiquitous format for data transfer, configurations, and so much more. Processing it can be easy, but it can also be a pain. As always with Perl, there is always more than one way to do it… So I’ve created a framework to test, and compare some of the modules available.

The modules I considered were:

XML::Simple
XML::Smart
XML::Parser (using Tree style)
XML::Twig

My tests take a given XML (a list of films as generated by Mediathek), parse out the individual film entries, and print them out to a csv file. I don’t actually need this, but it’s a typical functionality you might need when working with XML, and allows some good comparisons with both small and huge file sizes.

Considerations:

The main difference between the clever modules (XML::Simple and XML::Smart) and the more grass-roots (XML::Parser and XML::Twig) is the method they use to work through the XML. To be clever, the clever ones suck in the whole XML, analyse it, and create a good hash/array representation of the elements. This has the advantage of having an easy to use format to work with afterwards, but also requires sucking in the whole XML in one go: not a problem for small files, but it quickly becomes dangerous because the system will use 10-20 times the file size in memory: a 60MB file can easily consume 1GB in memory! XML::Twig by contrast works through the file element by element, and so its memory usage is very controllable: it is possible to parse hundreds of megabytes without the perl application expanding beyond 12MB in memory! On the other hand: the clever modules are much easier to use, and you don’t need to know anything about your XML to process it. That makes these modules much more attractive for small (e.g. configuration) files.

For the details of the implementation, please have a look at the code examples:

git clone git@github.com:robin13/rcl-ironman.git

My tests showed these relationships, though your mileage might vary – it depends a lot on the depth/complexity, and variance in size (big on Mondays, small on Thursdays?) of the XML you are dealing with.

By Speed (seconds to process a 66MB file)

XML::Parser   26.5
XML::Simple   76.0
XML::Twig    132.8
XML::Smart   394.9

By Usability (subjective) to implement

XML::Simple     Easy
XML::Twig       Usable
XML::Parser     Complex
XML::Smart      Not so smart...

By Memory consumption (system memory in kB used for a 66MB file)

XML::Twig        972
XML::Parser   506532 (using Tree style)
XML::Simple   628268
XML::Smart   1336604

To summarise:

If you want high performance: XML::Parser
If you want relatively easy, memory efficient parsing of huge files: XML::Twig
If you want easy-to-implement for small files: XML::Simple
If you want to have a bad deal: XML::Smart

What I left out:

Probably a lot of usable modules…  If you know of something important, please comment!
Input/Output: XML::Simple and XML::Parser can output XML too.  That might be important for you (e.g. config files)!
My code is proof of concept, not highly optimised: feel free to improve, or even add other modules to the tests.  Git makes that easy! 🙂


MediathekP – Deutsches Fernsehen mit Perl gucken

November 29th, 2010 robin Posted in Internet, Perl | Comments Off on MediathekP – Deutsches Fernsehen mit Perl gucken

Diese Woche veroeffentliche ich eine Anwendung zum runterladen von Oeffentliche Deutsche Sendungen.  Eigentlich ist das einen Perl Clone von Mediathek, und das meiste Arbeit (scrapping der Seiten nach den media URLs) wird auch von dieser Projekt gemacht, und als XML stuendlich zur Verfuegung gestellt.

MediathekP habe ich ein bischen anders als das Original verarbeitet.  Das XML (meistens >20MB) tuhe ich erstmal mit XML::Twig parsen und in ein SQLite datenbank lesen.  Das kann zwar erstmal etwas laenger dauern als der reine rein-saugen und darstellen wie es der Java Geschwister macht, aber auch ein PC mit sehr wenig Speicher uebrig kann das hinbekommen, und nachdem die Daten im Datenbank sind, sind Abfragen auch viel schneller und einfacher.  Das von mir letzte Woche veroeffentlichte Modul Video::Flvstreamer verwendet flvstreamer – ein RTMP streaming client um dann die Videos lokal zu speichern.

Eine kurze Anleitung:

git clone git://github.com/robin13/mediathekp.git
cd mediathekp
./mediathek.pl --cache_dir /path/to/cache --action refresh_media
./mediathek.pl --cache_dir /path/to/cache --action list
./mediathek.pl --cache_dir /path/to/cache --target_dir /path/to/target --action download --id 123

New perl module: Video::Flvstreamer

November 22nd, 2010 robin Posted in Internet | Comments Off on New perl module: Video::Flvstreamer

I’ve just published a new module: Video::Flvstreamer.

It’s a pretty simple wrapper around the command line application flvstreamer.  Flvstreamer is a handy tool to record video streams and save the video data as a video file you can play with your favourite video player.  There are many websites which have great streaming video content, but if you don’t like watching the video in your browser, or might want to watch it later, you can now use flvstreamer to capture the video.

Video::Flvstreamer can be used to integrate flvstreamer into Perl applications (here getting “Mit Offenen Karten” from Arte):

my $url = "rtmp://artestras.fcod.llnwd.net/a3903/o35/MP4:geo/videothek/EUR_DE_FR/arteprod/A7_SGT_ENC_04_043742-007-A_PG_HQ_DE?h=97599a8731270956b946df58bb6b6b95";
my $target = "/tmp/Mit_Offenen_Karten.flv";
my $flv = Video::Flvstreamer->new();
$flv->get( $url, $target );

I’m currently working on a Perl clone of Mediathek – a Java application for downloading German public TV shows: Video::Flvstreamer is an integral part of that application.

Video::Flvstreamer has one neat feature improvement on the command line: you can specify the number of times it should try to resume downloading.  Connections for downloading a stream can often be disconnected… Now that’s less of a bother! 🙂