Monday, April 16, 2012

HTML::TableExtract is beautiful

And, it will help you save time and make money ;-)

I was motivated to post this because of another one of those Stackoverflow questions. I decided at the outset not to answer that question because the poster basically wants a job done for him for free:

I need the script to get the HTML, parse the table then to save the content (User + Online time), I would also want it to run every 15 mins and to make a report in the end of the day.

However, a so-called answer stated:

in my opinion perl can get a little ugly.

does it need to be perl....if it does ot i would recommend python.

Of course, I am kinda used to people proclaiming Perl sucks, but the supreme irony of the ugliness of the post asserting Perl's ugliness motivated me.

HTML::TableExtract is beautiful. Over the years, it has saved me a lot of time, and even helped me make some money.

So, consider the Personal Income table available from the Bureau of Economic Analysis.

Let's say I want to get the Unemployment Insurance row out of that table. Here's how you do it using HTML::TableExtract:

#!/usr/bin/env perl

use strict; use warnings;
use HTML::TableExtract;

my $te = HTML::TableExtract->new(
    attribs => { id => 'tbl' },
);

# local copy of
# http://bea.gov/iTable/iTableHtml.cfm?reqid=9&step=3&isuri=1&903=58

$te->parse_file('personal-income.html');

my ($table) = $te->tables;

for my $row ($table->rows) {
    my ($undef, $label, @row) = @$row;
    next unless defined $label;
    if ($label eq 'Unemployment insurance') {
        print "$label\t@row\n";
    }
}

And, here is the output:

C:\temp> uu
Unemployment insurance 101.1 127.9 144.8 148.7 152.8 137.4 135.8 128.7 117.5 108.8 103.0 100.1

Of course, things can be refined, but this is pretty beautiful.

2 comments:

  1. yes, very useful, I use it to parse Jira pages, my variant:

    use strict;
    use warnings;
    use Modern::Perl;
    use HTML::Template;

    sub parse_chunk_table {
    my $html_string = shift;

    my $te =
    HTML::TableExtract->new( attribs => { class => 'confluenceTable' } );
    $te->parse($html_string);

    # Examine all matching tables
    my %chunk = ();
    foreach my $ts ( $te->tables ) {
    my $check_tables = 0;

    #print Dumper ($ts->rows);
    foreach my $row ( $ts->rows ) {
    if ( $$row[0] eq 'chunk_revision_sk' ) {
    $check_tables = 1;
    }
    if ( ( $$row[0] ne 'chunk_revision_sk' ) & ( $check_tables eq 1 ) )
    {
    $chunk{ $$row[0] }++;
    }
    }
    }

    return \%chunk;
    }

    ReplyDelete
  2. It does not print the output you stated.
    it gives a mistake .. :X
    revise it.

    ReplyDelete