Monday, April 9, 2012

Using \b in regex patterns can lead to surprising results

Consider this pattern matching question on Stackoverflow.

Basically, given a string like q{Start the function "function name" with (0x10)}, the task is to extract the name of the function within the quotation marks and the hexadecimal number in parentheses if it is preceded by the word function (there is one more condition which I am leaving out for brevity).

One of the answers included a liberal sprinkling of \b assertions in the pattern. If you don't have perldoc perlreref handy, \b means Match word boundary (between \w and \W)..

In most cases, that is not what the author wants.

For example:

#!/usr/bin/env perl

use strict; use warnings;

my @strings = (
    q{start the function "function name" with (0x10)},
    q{start the -function "function name" with (0x10)},
);

my %pat = (
    space => qr/[ ] function [ ] "( [^"]+ )"/x,
    boundary => qr/\bfunction\b.*"([^"]+)"/,
);

for my $k (keys %pat) {
    print "Pattern = [$k]\n";

    for my $s (@strings) {
        printf(
            "%s: %s\n",
            $s =~ $pat{$k} ? 'Matched' : 'Did not match',
            $s
        );
    }
}

This script outputs:

Pattern = [boundary]
Matched: start the function "function name" with (0x10)
Matched: start the -function "function name" with (0x10)
Pattern = [space]
Matched: start the function "function name" with (0x10)
Did not match: start the -function "function name" with (0x10)

I do not think the author intended for the pattern to match when the name of the function was preceded by -function rather than an unadorned function.

This is not intended to be a criticism of the OP: I just saw a good opportunity to point out how \b might be a source of confusion. I tend to avoid it.

No comments:

Post a Comment