Friday, August 23, 2013

Don't rush to love the ugly duckling

In yesterday's installment, I felt great about improving at least 3000% on an alternative approach.

In fact, the script I discussed in that post showed about a 40% improvement on the first one I had tried. That one did not involve buffering output etc.

It turns out, I had missed the obvious lesson of my earlier trials and tribulations: While splitting long lines is slow, it is faster than iterating through space separated fields using regular expression matches.

This point was driven home by Prakash Kailasa's comment where he showed a script that ran in almost half the time mine took.

I paraphrased his code below below:

my @keep;
while (<>) {
    my @data = split;

    unless (@keep) {
        @keep = (0, 1, 0, 1, 1);
        for (my $i = 5; $i < @data; $i += 3) {
            push @keep, 1, 1, 0;
        }
    }

    my $i = 0;
    print join(' ', grep $keep[$i++], @data), "\n";
}

Timings:

$ time ./zz.pl input.data > /dev/null 
real 0m21.861s
user 0m21.310s
sys 0m0.280s

It is about twice as fast as my version that jumps through all sorts of hoops. Thank you for pointing this out, Prakash.

Is this, however, an acceptable level of performance? After all, I am testing on a file with just 225M and only hundreds of thousands of columns of data whereas the OP has 20GB - 50GB files with millions of columns of data.

That's where another halving of the time might matter. For that, one can use Inline::C (and bring back the ugliness ;-)

#!/usr/bin/env perl

use strict;
use warnings;

use Inline C => <<'END_C'

/* This code 'works' only in a limited set of circumstances!
   Don't expect anything good if you feed it anything other
   than plain ASCII */

#include <ctype.h>

SV *
extract_fields(char *line, AV *wanted_fields)
{
    int ch;
    IV current_field = 0;
    IV wanted_field = -1;

    unsigned char *cursor = line;
    unsigned char *field_begin = line;
    unsigned char *save_field_begin;

    STRLEN field_len = 0;
    IV i_wanted = 0;
    IV n_wanted = av_len(wanted_fields);

    AV *ret = newAV();

    while (i_wanted <= n_wanted) {
        SV **p_wanted = av_fetch(wanted_fields, i_wanted, 0);
        if (!(*p_wanted)) {
            croak("av_fetch returned NULL pointer");
        }
        wanted_field = SvIV(*p_wanted);

        while ((ch = *(cursor++))) {

            if (!isspace(ch)) {
                continue;
            }

            field_len = cursor - field_begin - 1;
            save_field_begin = field_begin;
            field_begin = cursor;

            current_field += 1;

            if (current_field != wanted_field) {
                continue;
            }

            av_push(ret, newSVpvn(save_field_begin, field_len));
            break;
        }
        i_wanted += 1;
    }

    return newRV_noinc((SV *) ret);
}

END_C
;

my @keep;
while (my $line = <>) {
    unless (@keep) {
        @keep = (2, 4, 5);
        my @data = split ' ', $line;
        push @keep, grep +(($_ - 5) % 3), 6 .. scalar(@data);
    }
    my $fields = extract_fields($line, \@keep);
    print join(' ', @$fields), "\n";
}

This one takes (of course, after the first run where there is compilation overhead):

$ time ./zz.pl input.data > /dev/null 
real 0m11.539s
user 0m11.083s
sys 0m0.300s

Of course, at this point, one could realize that the only reason we are not using cut is because of the arithmetic involved in figuring out which fields to keep. So, why not use Perl to do the arithmetic and cut do the extraction:

#!/usr/bin/env perl

use strict;
use warnings;

my ($infile) = @ARGV;
die "need filename\n" unless defined $infile;

open my $in, '<', $infile
    or die "Cannot open '$infile': $!";

my $line = <$in>;

close $in
    or die "Cannot close '$infile': $!";

my @keep = (2, 4, 5);
my @data = split ' ', $line;
push @keep, grep +(($_ - 5) % 3), 6 .. scalar(@data);

my @cmd = (
    cut =>
    -d  => ' ',
    -f  => join(',', @keep),
    $infile
);

print "@cmd\n";
my $ret = system @cmd;
$ret and die "cut call failed\n";

Output

time ./cutmaker.pl input.data > /dev/null
Can't exec "cut": Argument list too long at ./cutmaker.pl line 30.
$ret and die "cut call failed\n"

It turns out there is another reason to use Perl ;-)

No comments:

Post a Comment