Sometimes, ugly solutions work better

Another day, another Stackoverflow question. For reference, here is the issue:

I have some huge files 30-50GB files and they are constructed like this - millions of columns and thousands of rows:
A B C D E 1 2 3 4 5 6 7 8 9 10
A B C D E 1 2 3 4 5 6 7 8 9 10
A B C D E 1 2 3 4 5 6 7 8 9 10
A B C D E 1 2 3 4 5 6 7 8 9 10
A B C D E 1 2 3 4 5 6 7 8 9 10
A B C D E 1 2 3 4 5 6 7 8 9 10
A B C D E 1 2 3 4 5 6 7 8 9 10
I would like to delete column “A”, and column “C”, then ever third of the number columns, so the “3” column and the “6” column, then “9” column until the end of the file. Space delimited.

I have some experience dealing with large files, so I thought I would offer a solution that took into account the long lines and the large numbers of rows the poster had to process. Splitting long lines is (understandably) slow: Lotsa things need to be moved to lotsa memory locations etc. My solution is to not use split when there are more than a few hundred columns in each line. Second, outputting stuff is also slow. My solution for that is to accumulate output and print it every thousand lines or so.

So, this is essentially what I recommended:

#!/usr/bin/env perl

use strict;
use warnings;

my $chunk;

while (my $line = <>) {
    my (undef, $x, undef, $y, $z, $rest) = split ' ', $line, 6;
    my $row = join ' ', $x, $y, $z;
    my $cell = 1;
    while ($rest =~ /(\S+)/g) {
        $row .= ' ' . $1 if $cell % 3;
        $cell += 1;
    }
    $chunk .= $row . "\n";
    unless ($. % 1_000) {
        print $chunk;
        $chunk = '';
    }
}

print $chunk;

The OP originally went with the other, more elegant answer (the OP changed his mind as I was composing this blog post):

while (<>) {
  my @data = split;

  unless (@remove) {
    @remove = (0, 2);
    for (my $i = 7; $i < @data; $i += 3) {
      push @remove, $i;
    }
  }

  splice @data, $_, 1 for reverse @remove;
  print $out join(' ', @data), "\n";
}

I mean, it even avoids computing indices repeatedly compared to my repeated, expensive modulus operation for every cell. (On the other hand, if one is optimizing that step, one might also move the reverse @remove into the body of unless (@remove) to avoid reversing a large array repeatedly as well).

I thought I should check if my hunch that my recommendation would really run faster made sense.

First, I created a small data file using the following script:

my $row = join ' ', 'A' .. 'ZZZZ';
say $row for 1 .. 100;

which gave me:

-rw-r--r-- 1 sinan sinan 225M Aug 21 20:44 input.data

Then it was time to test. I didn’t reboot between tests or anything. All I did was to first run my version, creatively named xx.pl, twice:

$ time ./xx.pl input.data > /dev/null
real 0m39.375s
user 0m39.120s
sys 0m0.207s
$ time ./xx.pl input.data > /dev/null
real 0m39.139s
user 0m38.647s
sys 0m0.227s

Presumably, the file is already in the cache in its entirety by now.

Then I ran the code from the ~~accepted~~ other answer:

$ time ./yy.pl input.data > /dev/null
^C
real 16m38.787s
user 16m31.767s
sys 0m0.767s

There are two possibilities. Either something is really wrong with my code that it doesn’t do much work and just terminates … or splitting a long line with 2,357,263 characters and with 475,254 fields and then splicing an array about 158 thousand times is genuinely slow. For the 100 line test file, we are talking about a total of 15.8 million splices for an array with hundreds of thousands of fields. No wonder I had to hit CTRL-C after 16 minutes.

Sometimes, ugly solutions work better

A. Sinan Unur

August 22, 2013