Thursday, August 22, 2013

Sometimes, ugly solutions work better

Another day, another Stackoverflow question. For reference, here is the issue:

I have some huge files 30-50GB files and they are constructed like this - millions of columns and thousands of rows:

A B C D E 1 2 3 4 5 6 7 8 9 10 
A B C D E 1 2 3 4 5 6 7 8 9 10 
A B C D E 1 2 3 4 5 6 7 8 9 10 
A B C D E 1 2 3 4 5 6 7 8 9 10 
A B C D E 1 2 3 4 5 6 7 8 9 10 
A B C D E 1 2 3 4 5 6 7 8 9 10 
A B C D E 1 2 3 4 5 6 7 8 9 10

I would like to delete column "A", and column "C", then ever third of the number columns, so the "3" column and the "6" column, then "9" column until the end of the file. Space delimited.

I have some experience dealing with large files, so I thought I would offer a solution that took into account the long lines and the large numbers of rows the poster had to process. Splitting long lines is (understandably) slow: Lotsa things need to be moved to lotsa memory locations etc. My solution is to not use split when there are more than a few hundred columns in each line. Second, outputting stuff is also slow. My solution for that is to accumulate output and print it every thousand lines or so.

So, this is essentially what I recommended:

#!/usr/bin/env perl

use strict;
use warnings;

my $chunk;

while (my $line = <>) {
    my (undef, $x, undef, $y, $z, $rest) = split ' ', $line, 6;
    my $row = join ' ', $x, $y, $z;
    my $cell = 1;
    while ($rest =~ /(\S+)/g) {
        $row .= ' ' . $1 if $cell % 3;
        $cell += 1;
    }
    $chunk .= $row . "\n";
    unless ($. % 1_000) {
        print $chunk;
        $chunk = '';
    }
}

print $chunk;

The OP originally went with the other, more elegant answer (the OP changed his mind as I was composing this blog post):

while (<>) {
  my @data = split;

  unless (@remove) {
    @remove = (0, 2);
    for (my $i = 7; $i < @data; $i += 3) {
      push @remove, $i;
    }
  }

  splice @data, $_, 1 for reverse @remove;
  print $out join(' ', @data), "\n";
}

I mean, it even avoids computing indices repeatedly compared to my repeated, expensive modulus operation for every cell. (On the other hand, if one is optimizing that step, one might also move the reverse @remove into the body of unless (@remove) to avoid reversing a large array repeatedly as well).

I thought I should check if my hunch that my recommendation would really run faster made sense.

First, I created a small data file using the following script:

my $row = join ' ', 'A' .. 'ZZZZ';
say $row for 1 .. 100;

which gave me:

-rw-r--r-- 1 sinan sinan 225M Aug 21 20:44 input.data

Then it was time to test. I didn't reboot between tests or anything. All I did was to first run my version, creatively named xx.pl, twice:

$ time ./xx.pl input.data > /dev/null 
real 0m39.375s
user 0m39.120s
sys 0m0.207s
$ time ./xx.pl input.data > /dev/null 
real 0m39.139s
user 0m38.647s
sys 0m0.227s

Presumably, the file is already in the cache in its entirety by now.

Then I ran the code from the accepted other answer:

$ time ./yy.pl input.data > /dev/null 
^C
real 16m38.787s
user 16m31.767s
sys 0m0.767s

There are two possibilities. Either something is really wrong with my code that it doesn't do much work and just terminates … or splitting a long line with 2,357,263 characters and with 475,254 fields and then splicing an array about 158 thousand times is genuinely slow. For the 100 line test file, we are talking about a total of 15.8 million splices for an array with hundreds of thousands of fields. No wonder I had to hit CTRL-C after 16 minutes.

1 comment:

  1. I don't think the accumulating the rows and printing in 1000-line chunks is saving anything. Since the OS (in my case, Linux) would provide efficient buffering of output anyway, it is a non-factor.

    The splice is the killer here. Here's the timings from my testing of your solution. Mine is a slower computer.

    >> time perl unur.pl < input.data > /dev/null
    real 0m55.511s
    user 0m54.927s
    sys 0m0.516s

    >> time perl unur.pl < input.data > /dev/null
    real 0m53.486s
    user 0m52.895s
    sys 0m0.488s

    Printing each line without accumulating:

    >> time perl unur2.pl < input.data > /dev/null
    real 0m53.568s
    user 0m52.963s
    sys 0m0.496s

    >> time perl unur2.pl < input.data > /dev/null
    real 0m52.552s
    user 0m52.163s
    sys 0m0.332s

    Now, here's my take. I retained the split from the other solution, and replaced splice with another approach. Also, no accumulating the outout rows:

    #!/usr/bin/env perl

    use strict;
    use warnings;

    my @remove;
    while (<>) {
    my @data = split;

    unless (@remove) {
    @remove = (1, 0, 1, 0, 0);
    for (my $i = 5; $i < @data; $i += 3) {
    push @remove, 0, 0, 1;
    }
    }

    my $i = 0;
    print join(' ', grep {!$remove[$i++]} @data), "\n";
    }

    >> time perl prakash.pl < input.data > /dev/null
    real 0m28.866s
    user 0m28.510s
    sys 0m0.304s

    >> time perl prakash.pl < input.data > /dev/null
    real 0m28.719s
    user 0m28.450s
    sys 0m0.224s

    I didn't expect it to be faster than yours, but surprisingly it was. I'd chalk that up to split being faster than using regex to get the input fields.

    ReplyDelete