Read select lines from a file using Perl

In this Stackoverflow question, the poster wants to know how to read only those lines whose immediate predecessor and successor lines satisfy certain conditions. For simplicity, the OP specifies that only lines that are not immediately preceded and followed by lines matching /foo/ should be printed, but why not generalize, right?

I posted an answer. Let’s try to generalize that. We basically want to write a function that accepts three parameters:

  1. An iterator that returns the next line from a stream.
  2. A test that takes the preceding and following lines as arguments and tells us if the middle line passes.
  3. A function to call with the middle line if the surrounding lines pass.

First, let’s write a function that turns a filehandle $fh into an iterator that starts from the current position of the file and returns the result of <$fh> with each call:

sub make_stream_iterator {
    my $fh = shift;
    my $pos = tell $fh;

    return sub {
        seek $fh, $pos, 0;
        my $line = <$fh>;
        $pos = tell $fh;
        return $line;
    };
}

On the other hand, if we have an array of lines already, we can do:

my $it = sub {
    my ($x) = @_;
    my $n = @$x;
    my $i = 0;
    return sub { ($i < $n) ? $x->[$i++] : undef };
}->(\@somearray);

Now, we need to keep an extra two lines in memory to decide whether a given line should pass our criteria. There is no reason to use any more memory. So, we’ll use a two line buffer. Here is a verbosely named function that creates the required iterator.

sub make_conditionally_pass_mid_iterator {
    state $counter = 0;
    my %args = (
        callback => sub {
            my $id = ++$counter;
            sub { print join("\t", "it$id:", @_) }
        }->(),
        input => sub { scalar <> },
        test => sub { 1 },
        @_,
    );

    my (
        $callback,
        $input,
        $test,
    ) = @args{qw(callback input test)};

    my @buffer = (undef, sub { $input->() });

    return sub {
        return unless defined(my $candidate = $buffer[1]->());

        my $next = $input->();
        $callback->($candidate) if $test->($buffer[0], $next);

        @buffer[0,1] = ($candidate, sub { $next });
        return 1;
    };
}

Allrighty then! There seems to be a bit of functional overload going on here. First, let’s take a look at that default callback. The state variable $counter enables us to give each default callback an id number. In the first sub, we increment and copy its current value to a variable over which the subsequent sub closes. And, that’s the sub we assign as the default callback.

The default input iterator is sub { scalar <> } which simply reads the next line from ARGV.

The default is for all lines to pass the test.

The next interesting thing is how we initialize @buffer:

my @buffer = (undef, sub { $input->() });

Here, we don’t simply read a line from the file. First, we set the line before the first to undef: There is no line before the first line. Next, we assign as the first line a promise to read a line from the input stream. This way, the mere act of creating an iterator does not change the file position. That might matter if you are going to use, say, the same filehandle multiple times in the same script (as my example below will do with DATA).

Next comes the definition of the actual iterator. Note that stream will already have been exhausted if the candidate line is undefined, so we check for that and bail out if that is the case. Next, we read the line after the candidate line. If both the preceding line and the following line pass their respective tests, we pass the candidate line to the callback.

Before we return from the iterator, we make the current candidate line the previous line, and wrap the current following line in a promise for the next invocation.

Note that I have chosen to relegate the handling of undefined previous and next lines at the beginning and end, respectively, of input to the test function to keep the iterator itself uncluttered. Besides, it might be more appropriate to treat those non-existent lines differently under different circumstances.

So, here are a couple of example iterators:

  • make_conditionally_pass_mid_iterator( input => make_stream_iterator(*DATA));

    This is a particularly inefficient way of printing every line in the script’s __DATA__ section.

  • make_conditionally_pass_mid_iterator( test => sub { my $prev = shift; my next = shift; return!((defined(prev) && $prev =~ /\Afoo/) && (defined($next) && $next =~ /\Aoof/) ); }, ),

    This one prints every line that is not immediately preceded and followed by lines that start with 'foo' and 'oof', respectively.

Here is a self contained test script:

#!/usr/bin/env perl

use 5.014; # not absolutely necessary, but c'mon ;-)
use strict;
use warnings;
use Text::Diff;

my @expected = (
    join("\n", qw(
        1
        2
        3
        foo
        a
        b
        c
        oof
        1
        foo
        9
        oof
        x
        y
        z
        1
        eof
        foo
    ), ''),
    join("\n", qw(
        1
        2
        3
        foo
        foo
        oof
        eof
    ), ''),
    join("\n", qw(
        1
        2
        3
        foo
        a
        b
        c
        oof
        1
        foo
        oof
        x
        y
        z
        1
        eof
        foo
    ), ''),
);

my @results;

my @iterators = (
    make_conditionally_pass_mid_iterator(
        callback => sub { $results[0] .= $_[0] },
        input => make_stream_iterator(\*DATA),
    ),
    make_conditionally_pass_mid_iterator(
        callback => sub { $results[1] .= $_[0] },
        input => make_stream_iterator(\*DATA),
        test => sub {
            my $prev = shift;
            return 1 unless defined $prev;
            return $prev =~ /\A[0-9]/;
        }
    ),
    make_conditionally_pass_mid_iterator(
        callback => sub { $results[2] .= $_[0] },
        input => make_stream_iterator(\*DATA),
        test => sub {
            my $prev = shift;
            my $next = shift;
            return !(
                (defined($prev) && $prev =~ /\Afoo/) &&
                (defined($next) && $next =~ /\Aoof/)
            );
        },
    ),
);

1 while (map $_->(), @iterators);

for my $i (0 .. 2) {
    unless ($results[$i] eq $expected[$i]) {
        warn "Iterator [$i] results don't match!\n";
        print diff(\$expected[$i], \$results[$i]);
    }
}

sub make_conditionally_pass_mid_iterator {
    state $counter = 0;
    my %args = (
        callback => sub {
            my $id = ++$counter;
            sub { print join("\t", "it$id:", @_) }
        }->(),
        input => sub { scalar <> },
        test => sub { 1 },
        @_,
    );

    my (
        $callback,
        $input,
        $test,
    ) = @args{qw(callback input test)};

    my @buffer = (undef, sub { $input->() });

    return sub {
        return unless defined(my $candidate = $buffer[1]->());

        my $next = $input->();
        $callback->($candidate) if $test->($buffer[0], $next);

        @buffer[0,1] = ($candidate, sub { $next });
        return 1;
    };
}

sub make_stream_iterator {
    my $fh = shift;
    my $pos = tell $fh;

    return sub {
        seek $fh, $pos, 0;
        my $line = <$fh>;
        $pos = tell $fh;
        return $line;
    };
}

__DATA__
1
2
3
foo
a
b
c
oof
1
foo
9
oof
x
y
z
1
eof
foo

Running the script above should produce no output.

I really want to learn OCaml!

In the mean time, read Higher Order Perl. It is freely available, but I bought my hardcopy a long time ago and love it.