Here is an interesting problem from Stackoverflow. Basically, the OP has a bunch of strings which I will refer to as identifiers. These strings consist of a bunch of English words and the occasional abbreviation, such as desc, run together as in bankaccountstatusdesc. The task is to turn that string into bank account status desc.
I do not know how to write a 100% accurate solution for this problem. I don't know if it is even possible to do that. However, it took me about 10 minutes to come up with a possibly good enough
solution and with another half an hour to refine it a little.
The inspiration came from James Hague's wonderful essay A Spellchecker Used to Be a Major Feat of Software Engineering. As someone who had to fight with WordStar's .ovl files and who once wrote memory bank switching code on a ZX Spectrum 128, I really appreciated that essay.
The idea is simple: Pretty much evey *nix out there comes with a /usr/share/dict file consisting of English words. First, create a pattern consisting of all of those words (ignoring anything shorter than three characters, sorting by length of word and alphabetically). Then, match each identifier progressively starting from the end of the identifier against this pattern, removing matched portions. If, at the end of this, there is anything remaining in the original string, flag that particular identifier for human attention.
In the refined version below, I allowed for the dictionary file, the list of identifiers, and the output to be specified as command line arguments. Only the dictionary file is mandatory. One can optionally specify another set of strings to be considered as "words" by specifying the command line option --mydict. For the OP's case, one has to specify a file containing the string desc and the misspelled conrol as well as any other abbreviations.
Here is the code (caution: I went a little iterator-crazy with this one!)
#!/usr/bin/perl
# Copyright (C) 2010 A. Sinan Unur
# License http://search.cpan.org/perldoc/perlartistic
# THIS SCRIPT IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS
# OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION,
# THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS
# FOR A PARTICULAR PURPOSE
use strict; use warnings;
use File::Slurp;
use Getopt::Long;
my ($opt_list, $opt_dict, $opt_mydict, $opt_out);
GetOptions(
"list=s" => \$opt_list,
"dict=s" => \$opt_dict,
"mydict=s" => \$opt_mydict,
"out=s" => \$opt_out,
) or die sprintf(
"%s %s [ %s ] [ %s ] [ %s ]\n",
$0,
"--dict='dictionary file'",
"--mydict='custom dictionary file'",
"--list='list file'",
"--out='output file'"
);
die "You must specify a dictionary file\n" unless defined $opt_dict;
my $matcher = make_matcher($opt_dict, $opt_mydict);
my $reader = make_reader($opt_list);
my $writer = make_writer($opt_out);
while ( defined( my $line = $reader->() ) ) {
my @identifiers = split ' ', $line;
for my $identifier ( @identifiers ) {
my $orig = $identifier;
my @stack;
while ( defined (my $word = $matcher->($identifier) ) ) {
push @stack, $word;
}
push @stack, $identifier, '*' if length $identifier;
$writer->(
sprintf "%s : %s\n", $orig, join(' ', reverse @stack)
);
}
}
sub make_reader {
my ($list_file) = @_;
return sub { <> } unless defined $list_file;
open my $list_h, '<', $list_file
or die "Cannot open '$list_file' for reading: $!";
return sub { <$list_h> };
}
sub make_writer {
my ($output_file) = @_;
return sub { print @_ } unless defined $output_file;
return sub { append_file $output_file, \ @_ };
}
sub make_matcher {
my ($dict_file, $my_dict_file) = @_;
my $pat = join '|',
map quotemeta,
sort { length $b <=> length $a || $a cmp $b }
grep 2 < length,
map { chomp; $_ }
read_file( $dict_file ),
defined( $my_dict_file ) ? read_file($my_dict_file) : ()
;
my $re = qr/(?<word>$pat)\z/;
return sub {
return $+{word} if $_[0] =~ s/$re//;
return;
}
}
Now, given a list of identifiers in list.txt:
payperiodmatchcode labordistributioncodedesc dependentrelationship
actionendoption actionendoptiondesc addresstype addresstypedesc
historytype psaddresstype rolename bankaccountstatus
bankaccountstatusdesc bankaccounttype bankaccounttypedesc
beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass
beneficiaryclass beneficiaryclassdesc benefitactioncode
benefitactioncodedesc benefitagecontrol benefitagecontroldesc
ageconrolagelimit ageconrolnoticeperiod
and a custom dictionary in mydict.txt, we can invoke the script as:
$ expand_identifiers --dict=/usr/share/dict/words --mydict=mydict.txt \ --list=list.txt --output=expanded.txt
and get the output in the file expanded.txt:
payperiodmatchcode : pay period match code labordistributioncodedesc : labor distribution code desc dependentrelationship : dependent relationship actionendoption : action end option actionendoptiondesc : action end option desc addresstype : address type addresstypedesc : address type desc historytype : history type psaddresstype : * ps address type rolename : role name bankaccountstatus : bank account status bankaccountstatusdesc : bank account status desc bankaccounttype : bank account type bankaccounttypedesc : bank account type desc beneficiaryamount : beneficiary amount beneficiaryclass : beneficiary class beneficiarypercent : beneficiary percent benefitsubclass : benefit subclass beneficiaryclass : beneficiary class beneficiaryclassdesc : beneficiary class desc benefitactioncode : benefit action code benefitactioncodedesc : benefit action code desc benefitagecontrol : benefit age control benefitagecontroldesc : benefit age control desc ageconrolagelimit : age conrol age limit ageconrolnoticeperiod : age conrol notice period
Notice that the suspicious identifier was marked with an asterisk.
One would still need some kind of human check on the output, but presumably, that would require much less effort than manually splitting thousands of such identifiers.
As I mentioned over on Ovid's perl blog - check out
ReplyDeletehttp://www.perlmonks.org/?node_id=712392
Assuming you have a dictionary file containing all possible valid words, then it is absolutely possible to write a 100% accurate solution.
The solution I ended up going with used a trie implementation with a handful of options such as minimum token length as well as excluding solutions with invalid tokens (words not in the dictionary file).
@Joshua Thanks for that link.
ReplyDeleteJoshua is incorrect. I do not believe it is possible to write a 100% accurate solution. The problem is that, in practice, the data is not clean. And even if it was, you still have to determine how to split:
ReplyDeletebenefitaccountedit
Benefit Account Edit
Benefit Accounted It
Benefit Account Ed It
Which is correct?
The following link proposes a solution that attains 95% accuracy (or higher):
http://www.whitemagicsoftware.com/software/java/wordsplit/
@Anonymous
ReplyDeleteAll 3 are valid solutions which is why 100% accuracy is possible. If you read the Stackoverflow example, there were multiple possible outcomes just as in your example. Provided you produce ALL possible solutions then you are 100% accurate.