Friday, February 1, 2013

Large gzipped files, long lines, extracting columns etc

I am looking at directories with large numbers of gzipped files. Each file is about 20MB - 30 MB compressed, has about 200,000 to 400,000 lines. Lines can be about 1 KB long and may contain about 300 - 350 fields.

I am doing some exploratory data analysis with these data. Trial and error is a great method to get a feel for your data, but when each "trial" step takes a good portion of the day, you sometimes forget why you are doing what you are doing in the first place. Besides, other things come up, ideas become orphaned, I get distracted etc etc.

Clearly, I'd make better progress if I can speed up each trial in this trial and error process.

For the time being, assume that I am stuck with puny hardware. The data cannot be uploaded to the "cloud" etc. What can I do to make things take a reasonable amount of time?

The underlying task implemented in the Perl script is simple: It reads lines from a gzipped log file, splits it into fields, selects some, generates and output file consisting of selected lines and columns.

First things first: PerlIO::gzip is much faster than IO::Uncompress::Gunzip (although the README for PerlIO::gzip makes it clear you should not trust your data to it, I figured not much could go wrong just reading a bunch of gzipped ASCII files).

The speed difference came as a surprise to me. The first version of the script, using IO::Uncompress::Gunzip was taking about 60 seconds per file. Moving to PerlIO::gzip took that down to about 20 seconds per file. (For the record, I am using a perlbrew compiled perl 5.16.2 on an almost 3 year old MacBook Pro with a 2.4Ghz Core Duo CPU and 8 GB memory running OSX 10.8.2).

Well, that's a great improvement, we are talking about many hundreds of files. That's still too long for freewheeling data mining.

Besides, look at gunzip:

$ ls -l test-file-2.gz 
-rw-r--r--  1 user  user    19M Jan 31 10:37 test-file-2.gz
$ time gunzip test-file-2.gz 

real 0m2.831s
user 0m0.963s
sys 0m0.146s

Clearly, there is some room for improvement. In fact, take a look at what this Perl script can do:

$ cat nop.pl
#!/usr/bin/env perl

use autodie;
use strict;
use warnings;

use PerlIO::gzip;

my $filename = 'test-file-3.gz';

open my $in, '<:gzip', $filename;

while (my $line = <$in>) {
    my $x = length $line;
}

close $in;

__END__

And, it takes less than a second:

$ time ./nop.pl 
real 0m0.881s
user 0m0.840s
sys 0m0.026s

So, why were my scripts taking 20 seconds or so to go through each file?

Let's change the body of that loop to my @fields = split /;/, $line; and time the code again:

$ time ./nop.pl 

real 0m18.083s
user 0m17.943s
sys 0m0.062s

Whoa!!!

The penalty from just split seems to be at the root of the performance problem, even before any analysis is taken into account.

My first reaction was to try again with perl 5.8.9, 5.10.1, 5.12.4, and 5.14.2, all built with MacPorts to see if my brewed perl was the problem. None was faster. In fact, 5.14.2 took almost 30 seconds for a single file.

What the hey?!

At this point, spawning a couple of extra programs seems perfectly reasonable:

#!/usr/bin/env perl

use autodie; 
use strict;
use warnings;

my $filename = 'test-file-3.gz';

open my $in, '-|', "zcat $filename | cut -d \\; -f 10-15 -f 21 -f 33";

while (my $line = <$in>) {
    my @fields = split /;/, $line;
}

close $in;

__END__

Performance is much improved:

$ time ./nop.pl 

real 0m2.799s
user 0m3.839s
sys 0m0.102s

We are still talking about a single pass through all the files in a directory taking close to an hour, but that is much better than a third to half of a day!

Part of the improvement comes from the fact that the pipeline takes advantage of multiple cores. Plus, cut is very effective in extracting only what split needs.

A naive attempt at using the example in perlipc resulted in the sub-process dedicated to splitting lines utilizing one core 100% whereas the gzip reader barely touched 7% utilization on the other core.

At this point, I had to ask myself if I want any further improvement. … Well … yes, of course. But, at what cost?

There are basically two directions we can go in. One is to try to use another core for splitting which means forking another time in the parent, figuring out how to switch between the children etc etc or building a threaded perl on this machine. Also, do I want to maintain the original order of when outputting lines? Nah, that's a hard problem, but the rest isn't straightforward either.

The other alternative is to write a dedicated splitter.

There again, there are two alternatives. One is to build a regex pattern for the columns I want to extract. The other is to write some C routine dedicated to splitting on semicolons.

For the regex route, something along the lines of:

my @fields = (10 .. 15, 21, 33, 99);
my $sep = ';';

my $matcher;
my $start = 0;

for my $i ( @fields ) {
    my $d = $i - $start;
    my $segment = join(
        $sep,
        $d > 0 ? sprintf("(?:[^$sep]*?){%d}", $i - $start) : (),
        "([^$sep]*?)"
    );

    if (length $matcher) {
        $matcher = join $sep, $matcher, $segment;
    }
    else {
        $matcher = $segment;
    }

    $start = $i + 1;
}

my $re = qr/\A$matcher/;

actually doesn't have horrible performance. For the same specification, it is only about half a second or so slower than the zcat-cut pipeline. Both implementations seem to get slower when higher column numbers are requested.

In fact, if I do the read in a forked child, and column extraction in the parent, it takes the same amount of time as the zcat-cut pipeline.

Right now, the fastest I can get this code is still too slow. For the fields (10 .. 15, 21, 33, 99), I can process about 50 thousand lines per second. I would like to bring that to at least 100,000 lines/second.

Any suggestions? I probably overlooked something rather obvious ;-)

4 comments:

  1. First though that crossed my mind was to convert fields to lines with a first pass of sed to s/;/\n/ and double the last \n of each line.

    This should let you use <> to read field by field, using Perl I/O internal which is *very* fast. The empty line would mark the end of the record.

    Not sure if it covers all your requirements, but it might be worth a shot.

    Bye,

    ReplyDelete
  2. I'm curious if Text::CSV_XS, being C code, would be faster than a split.

    Also, I would be tempted to try using AnyEvent::Util::fork_call in an event loop, processing the output from zcat. And using Text::CSV_XS in that subprocess.

    Finally, I'd also consider tearing all the data out that I want, stuffing it into a DB (whether SQLite or something heftier like DB2 depends on the amount of data, I suppose), and working with it from there. Tear it apart just once, doesn't matter if it takes all day, and then work with the data from the torn-apart repository.

    ReplyDelete
  3. Hrm . . . I can think of a couple of suggestions, but it's hard to know if any of them would be effective without knowing a little more about the problem and the data.

    Do you always need to read and process every row? If not, then I'd strongly consider either trying to pre-parse the data to break it into the chunks I might need, or I'd dump the whole thing into a database so I could query against just the rows I'm interested in. Reducing the amount of data that you're processing up front will result in a big win. If you do need to always process all rows, then any solution for pre-parsing or indexing is going to be limited and difficult.

    You mention 300-350 columns. Does that mean that the column count differs between rows/files? Would it be possible to normalize that so that every row had the same number of columns, even if they were empty? If you were guaranteed to have the same number of columns, I might try breaking it out on a column basis, possibly instead of a row basis. For example, if there are 350 possible columns, have 350 files, one for each column, and insert a null value or an empty line for those "rows" that don't have a value in that column.

    At that point, your split problem disappears, as the files are pre-split. You also no longer have to read in the full 300-350 columns of data for each row to get at a dozen columns worth of data; you just read the files for the columns you want. This is similar to what you'd do with a DB, but would be faster (less overhead) if you plan on processing every row each time.

    The only other obvious thing I can think of, specific to this situation, would be to try to maximize the use of your 2 CPU cores and run two operations in parallel (probably map-reduce style, read all your columns with 2-3 processes, outputting the results to a file or parent process). How easy this is would depend heavily on whether your rows needed to remain in order. If they don't, you could use a number of tools, from something as simple as GNU parallel to something like one of the task/job queue Perl modules. If you do need them to remain ordered, then it becomes a bit more complicated, as you have to track the jobs separately and combine them in order.

    It's not an exact match to your situation, but a similar question came up on PerlMonks recently; it might be worth reading through the suggestions there, too: http://www.perlmonks.org/?node_id=1014517

    ReplyDelete
  4. I have the same curiosity as Tanktalus

    ReplyDelete