An HTML parser will make your life easier even if you extract information using regexes from a web page

Some Background

An ongoing pet-peeve of mine is that the U.S. Department of Labor, Employment & Training Administration puts out press releases like this where they say things like:

In the week ending September 15, the advance figure for seasonally adjusted initial claims was 382,000, a decrease of 3,000 from the previous week’s revised figure of 385,000.

Well, the previous week’s advance number was indeed 382,000, the same as this particular week’s advance number. But, they compare this weeks advance number to the revised number for the previous week, and say that the initial unemployment claims fell. I find this annoying because it is not an apples-to-apples comparison. The information we have this week says that this weeks unrevised number is the same as the previous week’s unrevised number and we do not yet know how the number of initial unemployment claims will change after this weeks revisions.

Just so we are clear: I am not saying there is anything fishy about such revisions. I just think comparisons ought to be made between comparable numbers.

This motivated me to put together a table containing three columns: Press release date, advance number for this week, revised number for this week from the following week’s report, so I could compare the two and look at the trend in each individually (obviously, last available row will have a missing value in the third column).

How HTML::TokeParser::Simple helped

Based on my experience in and with government and education sectors, I’ll guess that the press releases are prepared using the following procedure:

  1. A trusted admin assistant is given the numbers.

  2. The admin assistant copies over the previous week’s press release to a file and give is a name corresponding to the date of the press release.

  3. The admin assistant then manually changes the values in the press release.

  4. Someone does something that makes the document visible to the rest of the world at the appropriate time.

The editing step is what frustrates me here. Any one of us can write a script that fills a template using the output of a program in no time. Clearly, the U.S. Federal Government can have something like that written. So, why do I think the documents are being edited manually? Because, during the process of extraction, I encountered some really quirky typos: A missing space or an extra period here and there, inconsistent naming of months (is it ‘September’, ‘Sept.’, or ‘Sep.’?)

And, I was only looking at a single line in a single paragraph.

Press releases going back to 2002 can be found at http://ows.doleta.gov/press/. I downloaded the releases for 2006 to the present using wget -np -r -w 3 --random-wait http://ows.doleta.gov/press/YYYY/.

The HTML source for that paragraph I am interested in looks like this:

<P><B>UNEMPLOYMENT INSURANCE WEEKLY CLAIMS REPORT</B></P>

<P>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<B><U>SEASONALLY ADJUSTED DATA</U></B><BR><BR>

In the week ending September 15, the advance figure
for seasonally adjusted <B>initial claims</B>
was 382,000, a decrease of 3,000 from the previous
week's revised figure of 385,000. The 4-week moving
average was 377,750, an increase of 2,000 from the
previous week's revised average of 375,750.</P>

I only need two bits of information from that paragraph. Clearly, a simple pattern along the lines of /was ([0-9]+,[0-9]+).+figure of ([0-9]+,[0-9]+)/ should work, but I wanted to make sure I was extracting the pieces from the correct location.

The solution is to look for <u> tags, and, when you find one with the content 'SEASONALLY ADJUSTED DATA', grab the text of the following paragraph, normalize the spaces, and extract the two bits we are looking for.

I chose not to try to figure out the full date of the period the press release covered. If you do care, you can use DateTime to get the date of the first Tuesday before the press release so that you don’t run into problems with releases in the first week of the year.

I chose to use HTML::TokeParser::Simple over modules which build trees because the structure of the HTML is really funky, and I really did not need a whole tree structure.

Here is the script I used to process the files once they were downloaded:

#!/usr/bin/perl

use strict; use warnings;

use File::Find;
use File::Spec::Functions qw( canonpath );
use HTML::TokeParser::Simple;

my $TOP = 'E:\Home\asu1\src\doleta\ows.doleta.gov\press';

my @files;
find(sub {
        return unless -f;
        return unless /\A ([0-9]{6}) [.] asp \z/x;
        my ($m, $d, $y) = (/([0-9]{2})/g);
        push @files, [
            sprintf('%04d/%02d/%02d', 2000 + $y, $m, $d),
            canonpath($File::Find::name)
        ];
        return;
    },
$TOP);

@files = sort { $a->[0] cmp $b->[0] } @files;

my @data;

my $pat = qr{
    (?: initial [ ] claims [ ] was [ ] (? [0-9]+,[0-9]+) ) .+
    (?: revised [ ] figure [ ] of [ ] (? [0-9]+,[0-9]+) )
}x;

for my $file ( @files ) {
    my ($date, $name) = @$file;

    my $parser = HTML::TokeParser::Simple->new(file => $name);
    $parser->unbroken_text(1);

    while (my $token = $parser->get_tag('u')) {
        my $content = $parser->get_text('/u');
        $content =~ s/\s+/ /g;
        last if $content eq 'SEASONALLY ADJUSTED DATA';
    }

    my $text = $parser->get_text('p');
    $text =~ s/\s+/ /g;

    my %obs;
    $text =~ $pat
        or die "Failed to parse '$file':\n\n[[[$text]]]\n";
    %obs = %+;

    s/[^0-9]+//g for values %obs;

    $data[-1]->[2] = $obs{prev_rev} if @data;
    push @data, [$date, $obs{advance}];
}

print join("\t", @$_), "\n" for @data;

It wouldn’t be much work (say, two weeks) to process all previous press releases and extract all data into appropriate tables, make it available in some consistent, machine readable format separate from whatever visual styling used.

Oh, and, my gut feeling is that the number of initial unemployment claims for the week ending September 15th will be revised up to 389,000.

We can check my if my hunch is correct on September 27th.