Don't blame Perl for all ugly code!

It seems every few days someone on Stackoverflow has to post a comment claiming Perl is write-only or Perl looks like line noise. Heck, there is even a question devoted to the myth.

Poster blunders asked where to find side by side comparisons of Perl, Python and Ruby code to solve the same problem. The question referenced a blog post by William Jiang.

Looking at William’s Perl code, it seems to me that it is clear where Perl’s reputation comes from: The low barrier to entry into writing functioning Perl programs allow people to pontificate without understanding the language.

William’s task is to take a CSV file, extract certain fields from it, and output fields that fit a certain criterion in a certain order. In his words:

Read each line of standard input and break it into fields at each tab.

Each field is wrapped in quotation marks, so remove them. Assume that there are no quotation marks in the interior of the field.

Store the fields in an array called record.

Create another array, records and fill it with all the records.

Make a new array, contactRecords, that contains arrays of just the fields we care about: SKUTITLE, CONTACTME, EMAIL.

Sort contactRecords by SKUTITLE.

Remove the elements of contactRecords where CONTACTME is not 1.

Print contactRecords to standard output, with the fields separated by tabs and the records separated by newlines.

For reference, here is his solution:

#!/usr/bin/perl -w

use strict;

my @records = ();

foreach my $line ( <> )
{
    my @record = map {s/"//g; $_} split("\t", $line);
    push(@records, \@record);
}

my $EMAIL = 17;
my $CONTACTME = 27;
my $SKUTITLE = 34;

my @contactRecords = ();
foreach my $r ( @records )
{
    push(@contactRecords, [$$r[$SKUTITLE],
          $$r[$CONTACTME], $$r[$EMAIL]]);
}

@contactRecords = sort {$$a[0] cmp $$b[0]} @contactRecords;
@contactRecords = grep($$_[1] eq "1", @contactRecords);

foreach my $r ( @contactRecords )
{
    print join("\t", @$r), "\n";
}

and his conclusion is: “The punctuation and my’s make this harder to read than it should be.”

Let’s see if we can re-write this in Perl.

First, note that William is slurping the input by using foreach my $r ( @records ). This first reads the entire file into memory and then goes through the array of lines in memory. This practice of gratuitously slurping input leads to the memory footprint of the program being proportional to the total size of input rather than the longest line and would create real trouble faced with larger or indefinite input sizes.

He then goes ahead, and creates another array consisting of a subset of elements for each line and filters that, further increasing the memory footprint.

The program is harder to read than necessary because of the use of extra variables, avoiding hashes when they are appropriate and due to the harder to read array-dereferencing syntax $$r[$EMAIL] instead of the standard arrow notation $r->[$EMAIL]. But really, use hash tables, Luke!

Here is a re-write of William’s script based on my understanding of the specs:

#!/usr/bin/perl

use strict; use warnings;
use Text::xSV;

my %fields = (
    email => 17,
    contactme => 27,
    skutitle => 34,
);

my $tsv = Text::xSV->new(sep => "\t");

my @to_contact;

while ( my $row = $tsv->get_row ) {
    my %record;
    @record{ keys %fields } = (@$row)[values %fields];

    $record{contactme}
        and push @to_contact, \%record;
}

@to_contact = sort {
    $a->{skutitle} cmp $b->{skutitle}
} @to_contact;

for my $record ( @to_contact ) {
    print join("\t", @{ $record }{ qw(skutitle contactme email) }), "\n";
}

First, note the use of Text::xSV. Using this module removes the need to do any custom processing on the fields read and makes the script more adaptable if the input format changes. In fact, if I knew the exact fields in the file or if the file contained a header row, the %fields hash and the latter hash slices would be completely unnecessary.

One can leverage CPAN a little more though. The module DBD::CSV allows the programmer to use DBI to access and extract information from a TSV file.

If the input file has the column names in the first line, the task can be accomplished by simply using the following script:

#!/usr/bin/perl

use strict; use warnings;

use DBI;

my $dbh = DBI->connect('dbi:CSV:', undef, undef, {
    f_dir => '.',
    f_ext => '.txt',
    csv_eol => $/,
    csv_sep_char => "\t",
    FetchHashKeyName => "NAME_lc",
    RaiseError => 1,
    PrintError => 1,
}) or die $DBI::errstr;

my $sth = $dbh->prepare(q{
    SELECT skutitle, email
    FROM clients
    WHERE contactme = 1
    ORDER BY skutitle
});

$sth->execute;

while ( my $row = $sth->fetchrow_arrayref ) {
    print join("\t", @$row), "\n";
}

Update:

Focusing on Text::xSV misses the point!

A lot of commenters seem to be hung up on my choice of using Text::xSV to parse the data. Here, then, is the version without using any external modules:

#!/usr/bin/perl

use strict; use warnings;

my %fields = (
    email => 17,
    contactme => 27,
    skutitle => 34,
);

my @to_contact;

while ( my $line = <> ) {
    chomp $line;
    my @row = split /\t/, $line;

    my %record;
    @record{ keys %fields } = (@row)[values %fields];

    for my $v ( values %record ) {
        $v =~ s/^"//;
        $v =~ s/"\z//;
    }

    $record{contactme}
        and push @to_contact, \%record;
}

@to_contact = sort {
    $a->{skutitle} cmp $b->{skutitle}
} @to_contact;

for my $record ( @to_contact ) {
    print join("\t", @{ $record }{ qw(skutitle contactme email) }), "\n";
}

And, of course, if you want to reinforce “Perl is line noise”

#!/usr/bin/perl -n
@r=map{s/"//g;$_}(split/\t/)[34,27,17];
print join("\t",@r[0,2])."\n" if $r[1];

#!/usr/bin/perl -naF\t
@F=map{s/"//g;$_}@F[34,17,27];$F[1
]and print join("\t",@F[0,2])."\n"

Don't blame Perl for all ugly code!

A. Sinan Unur

December 7, 2010

Update: