Large gzipped files, long lines, extracting columns etc

I am looking at directories with large numbers of gzipped files. Each file is about 20MB - 30 MB compressed, has about 200,000 to 400,000 lines. Lines can be about 1 KB long and may contain about 300 - 350 fields.

I am doing some exploratory data analysis with these data. Trial and error is a great method to get a feel for your data, but when each “trial” step takes a good portion of the day, you sometimes forget why you are doing what you are doing in the first place. Besides, other things come up, ideas become orphaned, I get distracted etc etc.

Clearly, I’d make better progress if I can speed up each trial in this trial and error process.

For the time being, assume that I am stuck with puny hardware. The data cannot be uploaded to the “cloud” etc. What can I do to make things take a reasonable amount of time?

The underlying task implemented in the Perl script is simple: It reads lines from a gzipped log file, splits it into fields, selects some, generates and output file consisting of selected lines and columns.

First things first: PerlIO::gzip is much faster than IO::Uncompress::Gunzip (although the README for PerlIO::gzip makes it clear you should not trust your data to it, I figured not much could go wrong just reading a bunch of gzipped ASCII files).

The speed difference came as a surprise to me. The first version of the script, using IO::Uncompress::Gunzip was taking about 60 seconds per file. Moving to PerlIO::gzip took that down to about 20 seconds per file. (For the record, I am using a perlbrew compiled perl 5.16.2 on an almost 3 year old MacBook Pro with a 2.4Ghz Core Duo CPU and 8 GB memory running OSX 10.8.2).

Well, that’s a great improvement, we are talking about many hundreds of files. That’s still too long for freewheeling data mining.

Besides, look at gunzip:

$ ls -l test-file-2.gz
-rw-r--r--  1 user  user    19M Jan 31 10:37 test-file-2.gz
$ time gunzip test-file-2.gz

real 0m2.831s
user 0m0.963s
sys 0m0.146s

Clearly, there is some room for improvement. In fact, take a look at what this Perl script can do:

$ cat nop.pl
#!/usr/bin/env perl

use autodie;
use strict;
use warnings;

use PerlIO::gzip;

my $filename = 'test-file-3.gz';

open my $in, '<:gzip', $filename;

while (my $line = <$in>) {
    my $x = length $line;
}

close $in;

__END__

And, it takes less than a second:

$ time ./nop.pl
real 0m0.881s
user 0m0.840s
sys 0m0.026s

So, why were my scripts taking 20 seconds or so to go through each file?

Let’s change the body of that loop to my @fields = split /;/, $line; and time the code again:

$ time ./nop.pl

real 0m18.083s
user 0m17.943s
sys 0m0.062s

Whoa!!!

The penalty from just split seems to be at the root of the performance problem, even before any analysis is taken into account.

My first reaction was to try again with perl 5.8.9, 5.10.1, 5.12.4, and 5.14.2, all built with MacPorts to see if my brewed perl was the problem. None was faster. In fact, 5.14.2 took almost 30 seconds for a single file.

What the hey?!

At this point, spawning a couple of extra programs seems perfectly reasonable:

#!/usr/bin/env perl

use autodie;
use strict;
use warnings;

my $filename = 'test-file-3.gz';

open my $in, '-|', "zcat $filename | cut -d \\; -f 10-15 -f 21 -f 33";

while (my $line = <$in>) {
    my @fields = split /;/, $line;
}

close $in;

__END__

Performance is much improved:

$ time ./nop.pl

real 0m2.799s
user 0m3.839s
sys 0m0.102s

We are still talking about a single pass through all the files in a directory taking close to an hour, but that is much better than a third to half of a day!

Part of the improvement comes from the fact that the pipeline takes advantage of multiple cores. Plus, cut is very effective in extracting only what split needs.

A naive attempt at using the example in perlipc resulted in the sub-process dedicated to splitting lines utilizing one core 100% whereas the gzip reader barely touched 7% utilization on the other core.

At this point, I had to ask myself if I want any further improvement. … Well … yes, of course. But, at what cost?

There are basically two directions we can go in. One is to try to use another core for splitting which means forking another time in the parent, figuring out how to switch between the children etc etc or building a threaded perl on this machine. Also, do I want to maintain the original order of when outputting lines? Nah, that’s a hard problem, but the rest isn’t straightforward either.

The other alternative is to write a dedicated splitter.

There again, there are two alternatives. One is to build a regex pattern for the columns I want to extract. The other is to write some C routine dedicated to splitting on semicolons.

For the regex route, something along the lines of:

my @fields = (10 .. 15, 21, 33, 99);
my $sep = ';';

my $matcher;
my $start = 0;

for my $i ( @fields ) {
    my $d = $i - $start;
    my $segment = join(
        $sep,
        $d > 0 ? sprintf("(?:[^$sep]*?){%d}", $i - $start) : (),
        "([^$sep]*?)"
    );

    if (length $matcher) {
        $matcher = join $sep, $matcher, $segment;
    }
    else {
        $matcher = $segment;
    }

    $start = $i + 1;
}

my $re = qr/\A$matcher/;

actually doesn’t have horrible performance. For the same specification, it is only about half a second or so slower than the zcat-cut pipeline. Both implementations seem to get slower when higher column numbers are requested.

In fact, if I do the read in a forked child, and column extraction in the parent, it takes the same amount of time as the zcat-cut pipeline.

Right now, the fastest I can get this code is still too slow. For the fields (10 .. 15, 21, 33, 99), I can process about 50 thousand lines per second. I would like to bring that to at least 100,000 lines/second.

Any suggestions? I probably overlooked something rather obvious ;-)