Monday, October 4, 2010

Sometimes, a brute force solution is acceptable

Here is an interesting problem from Stackoverflow. Basically, the OP has a bunch of strings which I will refer to as identifiers. These strings consist of a bunch of English words and the occasional abbreviation, such as desc, run together as in bankaccountstatusdesc. The task is to turn that string into bank account status desc.

I do not know how to write a 100% accurate solution for this problem. I don't know if it is even possible to do that. However, it took me about 10 minutes to come up with a possibly good enough solution and with another half an hour to refine it a little.

The inspiration came from James Hague's wonderful essay A Spellchecker Used to Be a Major Feat of Software Engineering. As someone who had to fight with WordStar's .ovl files and who once wrote memory bank switching code on a ZX Spectrum 128, I really appreciated that essay.

The idea is simple: Pretty much evey *nix out there comes with a /usr/share/dict file consisting of English words. First, create a pattern consisting of all of those words (ignoring anything shorter than three characters, sorting by length of word and alphabetically). Then, match each identifier progressively starting from the end of the identifier against this pattern, removing matched portions. If, at the end of this, there is anything remaining in the original string, flag that particular identifier for human attention.

In the refined version below, I allowed for the dictionary file, the list of identifiers, and the output to be specified as command line arguments. Only the dictionary file is mandatory. One can optionally specify another set of strings to be considered as "words" by specifying the command line option --mydict. For the OP's case, one has to specify a file containing the string desc and the misspelled conrol as well as any other abbreviations.

Here is the code (caution: I went a little iterator-crazy with this one!)

#!/usr/bin/perl
# Copyright (C) 2010 A. Sinan Unur
# License http://search.cpan.org/perldoc/perlartistic
# THIS SCRIPT IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS 
# OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, 
# THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS 
# FOR A PARTICULAR PURPOSE

use strict; use warnings;
use File::Slurp;
use Getopt::Long;

my ($opt_list, $opt_dict, $opt_mydict, $opt_out);
GetOptions(
    "list=s"   => \$opt_list,
    "dict=s"   => \$opt_dict,
    "mydict=s" => \$opt_mydict,
    "out=s"    => \$opt_out,
) or die sprintf(
    "%s %s [ %s ] [ %s ] [ %s ]\n",
    $0,
    "--dict='dictionary file'",
    "--mydict='custom dictionary file'",
    "--list='list file'",
    "--out='output file'"
);

die "You must specify a dictionary file\n" unless defined $opt_dict;

my $matcher = make_matcher($opt_dict, $opt_mydict);
my $reader  = make_reader($opt_list);
my $writer  = make_writer($opt_out);

while ( defined( my $line = $reader->() ) ) {
    my @identifiers = split ' ', $line;

    for my $identifier ( @identifiers ) {
        my $orig = $identifier;
        my @stack;

        while ( defined (my $word = $matcher->($identifier) ) ) {
            push @stack, $word;
        }

        push @stack, $identifier, '*' if length $identifier;

        $writer->(
            sprintf "%s : %s\n", $orig, join(' ', reverse @stack)
        );
    }
}

sub make_reader {
    my ($list_file) = @_;

    return sub { <> } unless defined $list_file;

    open my $list_h, '<', $list_file
        or die "Cannot open '$list_file' for reading: $!";

    return sub { <$list_h> };
}

sub make_writer {
    my ($output_file) = @_;

    return sub { print @_ } unless defined $output_file;

    return sub { append_file $output_file, \ @_ };
}

sub make_matcher {
    my ($dict_file, $my_dict_file) = @_;

    my $pat = join '|',
        map quotemeta,
        sort { length $b <=> length $a || $a cmp $b }
        grep 2 < length,
        map { chomp; $_ }
        read_file( $dict_file ),
        defined( $my_dict_file ) ? read_file($my_dict_file) : ()
        ;

    my $re = qr/(?<word>$pat)\z/;

    return sub {
        return $+{word} if $_[0] =~ s/$re//;
        return;
    }
}

Now, given a list of identifiers in list.txt:

    payperiodmatchcode labordistributioncodedesc dependentrelationship
    actionendoption actionendoptiondesc addresstype addresstypedesc
    historytype psaddresstype rolename bankaccountstatus
    bankaccountstatusdesc bankaccounttype bankaccounttypedesc
    beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass
    beneficiaryclass beneficiaryclassdesc benefitactioncode
    benefitactioncodedesc benefitagecontrol benefitagecontroldesc
    ageconrolagelimit ageconrolnoticeperiod

and a custom dictionary in mydict.txt, we can invoke the script as:

$ expand_identifiers --dict=/usr/share/dict/words --mydict=mydict.txt \
--list=list.txt --output=expanded.txt

and get the output in the file expanded.txt:

payperiodmatchcode : pay period match code
labordistributioncodedesc : labor distribution code desc
dependentrelationship : dependent relationship
actionendoption : action end option
actionendoptiondesc : action end option desc
addresstype : address type
addresstypedesc : address type desc
historytype : history type
psaddresstype : * ps address type
rolename : role name
bankaccountstatus : bank account status
bankaccountstatusdesc : bank account status desc
bankaccounttype : bank account type
bankaccounttypedesc : bank account type desc
beneficiaryamount : beneficiary amount
beneficiaryclass : beneficiary class
beneficiarypercent : beneficiary percent
benefitsubclass : benefit subclass
beneficiaryclass : beneficiary class
beneficiaryclassdesc : beneficiary class desc
benefitactioncode : benefit action code
benefitactioncodedesc : benefit action code desc
benefitagecontrol : benefit age control
benefitagecontroldesc : benefit age control desc
ageconrolagelimit : age conrol age limit
ageconrolnoticeperiod : age conrol notice period

Notice that the suspicious identifier was marked with an asterisk.

One would still need some kind of human check on the output, but presumably, that would require much less effort than manually splitting thousands of such identifiers.

4 comments:

  1. As I mentioned over on Ovid's perl blog - check out

    http://www.perlmonks.org/?node_id=712392

    Assuming you have a dictionary file containing all possible valid words, then it is absolutely possible to write a 100% accurate solution.

    The solution I ended up going with used a trie implementation with a handful of options such as minimum token length as well as excluding solutions with invalid tokens (words not in the dictionary file).

    ReplyDelete
  2. Joshua is incorrect. I do not believe it is possible to write a 100% accurate solution. The problem is that, in practice, the data is not clean. And even if it was, you still have to determine how to split:

    benefitaccountedit

    Benefit Account Edit
    Benefit Accounted It
    Benefit Account Ed It

    Which is correct?

    The following link proposes a solution that attains 95% accuracy (or higher):

    http://www.whitemagicsoftware.com/software/java/wordsplit/

    ReplyDelete
  3. @Anonymous

    All 3 are valid solutions which is why 100% accuracy is possible. If you read the Stackoverflow example, there were multiple possible outcomes just as in your example. Provided you produce ALL possible solutions then you are 100% accurate.

    ReplyDelete