Tuesday, December 7, 2010

Don't blame Perl for all ugly code!

It seems every few days someone on Stackoverflow has to post a comment claiming Perl is write-only or Perl looks like line noise. Heck, there is even a question devoted to the myth.

Poster blunders asked where to find side by side comparisons of Perl, Python and Ruby code to solve the same problem. The question referenced a blog post by William Jiang.

Looking at William's Perl code, it seems to me that it is clear where Perl's reputation comes from: The low barrier to entry into writing functioning Perl programs allow people to pontificate without understanding the language.

William's task is to take a CSV file, extract certain fields from it, and output fields that fit a certain criterion in a certain order. In his words:

  1. Read each line of standard input and break it into fields at each tab.
  2. Each field is wrapped in quotation marks, so remove them. Assume that there are no quotation marks in the interior of the field.
  3. Store the fields in an array called record.
  4. Create another array, records and fill it with all the records.
  5. Make a new array, contactRecords, that contains arrays of just the fields we care about: SKUTITLE, CONTACTME, EMAIL.
  6. Sort contactRecords by SKUTITLE.
  7. Remove the elements of contactRecords where CONTACTME is not 1.
  8. Print contactRecords to standard output, with the fields separated by tabs and the records separated by newlines.

For reference, here is his solution:

#!/usr/bin/perl -w

use strict;

my @records = ();

foreach my $line ( <> )
{
    my @record = map {s/"//g; $_} split("\t", $line);
    push(@records, \@record);
}

my $EMAIL = 17;
my $CONTACTME = 27;
my $SKUTITLE = 34;

my @contactRecords = ();
foreach my $r ( @records )
{
    push(@contactRecords, [$$r[$SKUTITLE],
          $$r[$CONTACTME], $$r[$EMAIL]]);
}

@contactRecords = sort {$$a[0] cmp $$b[0]} @contactRecords;
@contactRecords = grep($$_[1] eq "1", @contactRecords);

foreach my $r ( @contactRecords )
{
    print join("\t", @$r), "\n";
}

and his conclusion is: The punctuation and my's make this harder to read than it should be.

Let's see if we can re-write this in Perl.

First, note that William is slurping the input by using foreach my $r ( @records ). This first reads the entire file into memory and then goes through the array of lines in memory. This practice of gratuitously slurping input leads to the memory footprint of the program being proportional to the total size of input rather than the longest line and would create real trouble faced with larger or indefinite input sizes.

He then goes ahead, and creates another array consisting of a subset of elements for each line and filters that, further increasing the memory footprint.

The program is harder to read than necessary because of the use of extra variables, avoiding hashes when they are appropriate and due to the harder to read array-dereferencing syntax $$r[$EMAIL] instead of the standard arrow notation $r->[$EMAIL]. But really, use hash tables, Luke!

Here is a re-write of William's script based on my understanding of the specs:

#!/usr/bin/perl

use strict; use warnings;
use Text::xSV;

my %fields = (
    email => 17,
    contactme => 27,
    skutitle => 34,
);

my $tsv = Text::xSV->new(sep => "\t");

my @to_contact;

while ( my $row = $tsv->get_row ) {
    my %record;
    @record{ keys %fields } = (@$row)[values %fields];

    $record{contactme}
        and push @to_contact, \%record;
}

@to_contact = sort {
    $a->{skutitle} cmp $b->{skutitle}
} @to_contact;

for my $record ( @to_contact ) {
    print join("\t", @{ $record }{ qw(skutitle contactme email) }), "\n";
}

First, note the use of Text::xSV. Using this module removes the need to do any custom processing on the fields read and makes the script more adaptable if the input format changes. In fact, if I knew the exact fields in the file or if the file contained a header row, the %fields hash and the latter hash slices would be completely unnecessary.

One can leverage CPAN a little more though. The module DBD::CSV allows the programmer to use DBI to access and extract information from a TSV file.

If the input file has the column names in the first line, the task can be accomplished by simply using the following script:

#!/usr/bin/perl

use strict; use warnings;

use DBI;

my $dbh = DBI->connect('dbi:CSV:', undef, undef, {
    f_dir => '.',
    f_ext => '.txt',
    csv_eol => $/,
    csv_sep_char => "\t",
    FetchHashKeyName => "NAME_lc",
    RaiseError => 1,
    PrintError => 1,
}) or die $DBI::errstr;

my $sth = $dbh->prepare(q{
    SELECT skutitle, email
    FROM clients
    WHERE contactme = 1
    ORDER BY skutitle
});

$sth->execute;

while ( my $row = $sth->fetchrow_arrayref ) {
    print join("\t", @$row), "\n";
}

Update: Focusing on Text::xSV misses the point

A lot of commenters seem to be hung up on my choice of using Text::xSV to parse the data. Here, then, is the version without using any external modules:

#!/usr/bin/perl

use strict; use warnings;

my %fields = (
    email => 17,
    contactme => 27,
    skutitle => 34,
);

my @to_contact;

while ( my $line = <> ) {
    chomp $line;
    my @row = split /\t/, $line;

    my %record;
    @record{ keys %fields } = (@row)[values %fields];

    for my $v ( values %record ) {
        $v =~ s/^"//;
        $v =~ s/"\z//;
    }

    $record{contactme}
        and push @to_contact, \%record;
}

@to_contact = sort {
    $a->{skutitle} cmp $b->{skutitle}
} @to_contact;

for my $record ( @to_contact ) {
    print join("\t", @{ $record }{ qw(skutitle contactme email) }), "\n";
}

And, of course, if you want to reinforce Perl is line noise

#!/usr/bin/perl -n
@r=map{s/"//g;$_}(split/\t/)[34,27,17];
print join("\t",@r[0,2])."\n" if $r[1];

or

#!/usr/bin/perl -naF\t
@F=map{s/"//g;$_}@F[34,17,27];$F[1
]and print join("\t",@F[0,2])."\n"

19 comments:

  1. why not just use perl ?

    use Modern::Perl;

    my @fields =
    ( [ email => 17 ]
    , [ contactme => 27 ]
    , [ skutitle => 34 ]
    );

    my @field_names = map { $$_[0] } @fields;
    my @field_numbers = map { $$_[1] } @fields;

    say join "\t",@$_{ @field_names }
    for
    sort { $$a{skutitle} cmp $$b{skutitle} }
    map {
    chomp;
    my %member;
    @member{ @field_names }
    = (split /\t/)[ @field_numbers ]
    ;
    $member{contactme} ? \%member : ()
    } <>

    eiro@github

    ReplyDelete
  2. in the first code
    push(@contactRecords, [$$r[$SKUTITLE], $$r[$CONTACTME], $$r[$EMAIL]]);

    could be written as

    push @contactRecords, [ @$r[$SKUTITLE, $CONTACTME, $EMAIL] ];

    ReplyDelete
  3. Hardly a fair rewrite. The Python and Ruby code only uses the standard library as did the original Perl implementation yet your rewrite in Perl makes use of CPAN modules.

    If you want it to be a fair comparison you should do your rewrite whilst only making use of the Perl standard library.

    You could make the Python code much nicer if you were allowed to use any third party Python code too.

    ReplyDelete
  4. Missing from this is completing the rewrite by clearly rewriting the CSV parsing without punting to CPAN. I'd use three parallel arrays to store the three fields we care about -- again optimizing for memory footprint -- and then do the sort step by sorting an array of index numbers. Still too clever clever to be clear, though.

    A DBD::CSV solution might be the most maintainable, but it certainly isn't much fun.

    ReplyDelete
  5. @niz: I am not interested in language comparisons. As a Perl programmer, CPAN is at my disposal. It might not be fair that Perl programmers have access to such a resource, but we do.

    @david: I did briefly entertain the thought of re-writing the CSV parsing, but "punting" to CPAN results in a much more robust implementation in the face of what passes as TSV files out there.

    The main points are: 1) Do not slurp 2) Eliminate irrelevant records as soon as possible and 3) Make output order of columns easy to change and independent of input column definitions.

    ReplyDelete
  6. Niz,

    Is there some reasonable stable and available set of stuff for Python that you think would help solve this problem? If so I'd be interested in seeing it and learning how hard it is to install it. For example, with cpan I'd just do "cpanm XXX" and that would be it, or better yet, I'd write a Makefile for my application listing those deps and then let the code manage itself. How would a Pythonist approach that?

    ReplyDelete
  7. I think Perl's symbols actually make code easier to read than, say, Python or Java, because it's less regular - hence, less boring.

    Besides, symbols such as "@" pack meaning. They pack abstractions.

    This makes Perl a much more visually-oriented language than others. Or perhaps I should say "thought oriented". It's sort of like Japanese x English. A Japanese reader would argue that the Japanese characters pack more meaning, and can be better traced to their original ideas whereas English is just sound.

    So there's a lot more in Perl than "bad design" (in fact, IMHO, it's very good design). Perl somehow seems to stick in my mind. Python or Java is more like having to learn by rote.

    ReplyDelete
  8. awk 'BEGIN {FS="\t"} ; {print $17\t$27\t$34}'

    Perl still sucks.

    ReplyDelete
  9. The claim in this case is not 'pure perl using only core libraries is hard to read', it is 'actual perl in production environments is hard to read' compared to other languages.

    So of course it's fair to write it like a real professional programmer really would.

    I think it's absolutely right that perl has a reputation of being unmaintainable primarily because, in the mid to late 90s, a lot of brand new developers wrote their first real professional code in perl. And then had their first real experiences with maintaining it. There exists a lot of crappy legacy perl. But the stuff that good programmers write is every bit as readable as what good programmers in any other language write.

    It's kind of like the complaint about java that it's got a culture of cargo cult perma-junior programmers who don't know much more than how to copy and paste patterns from somewhere else. And that may have some validity - but it has nothing to do with java as a language, and a lot to do with hiring patterns and corporate biases during the dot com bubble.

    ReplyDelete
  10. all three of the languages have robust third party libraries which are easy to grab. Niz is right, how does this rebut the claims if you went outside of the implied parameters of the exercise to do it? Leveraging CPAN isn't a fair judge of the language as it isn't a language feature--it's a community feature.

    ReplyDelete
  11. This is the problem with Perl, sorry.

    You try to beautify the code, and it gets better.

    But compare the same solution in Ruby or Python, and it just *is* cleaner in both languages.

    ReplyDelete
  12. Seriously, what is the point of splitting TSV rows by hand and cleaning all quotation marks from a string? Using Text::xSV enables me to parse those lines robustly, but it is not essential if one wants to maintain misfeature-by-misfeature compatibility with the original.

    I am amused by alternative solutions (such as the one in awk) which fail to strip quotation marks enclosing the fields which was actually part of the original author's specification.

    You should also keep in mind is that I am not into "your language sucks" contests. Rather, if you are going to write Perl, write it like you mean it.

    ReplyDelete
  13. @shevegen ... and slurping files willy-nilly is just as bad in all languages.

    ReplyDelete
  14. @Sinan The original article was obviously trying to demonstrate what Perl looked like without using CPAN modules so that you could see what your Perl code would look like when you came up against a problem that does not have a prewritten CPAN module available for it.

    That is why your example is unfair. Because it completely misses the point of the original article. You won't always have CPAN to fall back on. Eventually you will need to write nasty Perl code yourself when there is no third party library available and that is what the original article was trying to demonstrate.

    ReplyDelete
  15. @Niz: I have since added both a version that does not use Text::xSV and a line-noise version (see the end of the post).

    There is very little difference between the version that uses Text::xSV and the one that does not in terms of lines of code. The main difference is in robustness.

    The point of my post was to illustrate better Perl, not claim that Perl is better than Python or Ruby.

    However, consider the following

    record = $_.split('\t').collect! {|field| field.gsub('"', '') }

    and

    contactRecords=records.collect {|r| [r[SKUTITLE], r[CONTACTME], r[EMAIL]] }

    If you do not already know Ruby, do you automatically know why one collect has a ! and the other does not?

    ReplyDelete
  16. People forget that Perl actually did encourage people to write code that we now consider poor practice. For instance, see my post about Let perl tell you how to code.

    ReplyDelete
  17. IMO the first version was easiest to read without reading the explanation first. Disagree something can be easier or cleaner when the maintainer has to learn a new module. Quite frankly thought the golf programs were easier to read than all four. Note they dont sort the resulting array.

    ReplyDelete
  18. use 5.018;

    use constant EMAIL => 17;
    use constant CONTACTME => 27;
    use constant SKUTITLE => 34;

    my @records = ();

    for my $line ( STDIN->getlines ) {
    $line =~ s/"//g; # bad idea but corresponds to the original script
    my @record = split( /\t/, $line );
    push @records, \@record;
    }


    for my $r ( sort { $a->[SKUTITLE] cmp $b->[SKUTITLE] } @records ) {
    say $r->[SKUTITLE], "\t", $r->[CONTACTME], "\t", $r->[EMAIL] if $r->[CONTACTME] == 1;
    }

    ReplyDelete
    Replies
    1. Nice. If you are used to perl you can drop $line since $_ is the target of the for loop:

      for (STDIN->getlines ) {
      s/"//g;
      my @record = split /\t/ ;
      push @records, \@record;
      }

      Sorting records for output *after* eliminating those where CONTACTME != 1 would be faster ... do we want grep() here?? What about map? ;-)

      Would it be easier to sort/grep the array of arrayrefs if it was a hash? my %records = {} ; ... just a question I see the "spec" requested arrays.

      Delete