Friday, April 27, 2012

How do you pass the PostData argument to the Navigate method of IWebBrowser2 using Win32::OLE?

Some years ago, well before Selenium made its debut, I wrote a program that scraped some information from a U.S. government web site. It was one of those "Made for Internet Explorer 6" monstrosities with 1 - 3 MB ViewStates embedded in each page. Most of the crucial actions could only be initiated via JavaScript from within the browser which made the usual Mechanize solution unworkable. Sure, I could have tried to capture and decipher the traffic using the Web Scraping Proxy, but my first few attempts were extremely disheartening.

The scraper was going to run on Windows workstations anyway, so I figured, I could just use Win32::OLE to control Internet Explorer: The approach worked, and I wrote a monstrosity that collected data fast enough to help Cornell Extension staff guide seniors through the process of choosing among Medicare Part D plans.

A few days ago, a question on Stackoverflow titled How can I read a https page's content using Perl on Windows, without installing OpenSSL? caught my attention. Now, to be honest, if either Crypt::SSLeay or IO::Socket::SSL etc are installed automatically with whatever distribution the OP is using, the whole issue is moot.

But, I was too lazy to investigate that (e.g., I know Strawberry Perl includes Crypt::SSLeay, but I wasn't motivated enough to go through the whole "discovery" process with the OP.

No, I was suddenly nostalgic for the pain of the days of controlling 16 parallel Internet Explorer instances on each machine, and I realized, I had never figured out how to do a POST using IWebBrowser2.

It turns out, it is quite straightforward: You just have to include a PostData argument to the invocation. Except …

The post data specified by PostData is passed as a SAFEARRAY Data Type structure. The VARIANT should be of type VT_ARRAY|VT_UI1 and point to a SAFEARRAY Data Type. The SAFEARRAY Data Type should be of element type VT_UI1, dimension one, and have an element count equal to the number of bytes of post data.

Well … OK then. How do I do that in Perl?

That is, given:

    my $html = $poster->post(
        'http://test.localdomain:8080/cgi-bin/showcgi.pl',
        [
            [ var1 => 'Yağmur', ],
            [ var2 => 'Øl']
        ],
    );

what do I do with that second argument so that, by the time it's passed on to IWebBrowser2's Navigate method, it is in the appropriate format?

Here are the steps:

  1. Make a single string:

    my $postdata = join '&', map join('=', @$_), @$data;

  2. Encode that string into octets:

    $postdata = encode('UTF-8', $postdata);

  3. Create a Variant to hold that:

    my $vPostData = Variant(VT_ARRAY|VT_UI1, length $postdata);

  4. Put $postdata in the Variant:

    $vPostData->Put($postdata)

  5. Invoke Navigate:

    $ie->Navigate(
          $url,
          $flags,
          '_self',
          $vPostData,
          "Content-Type: application/x-www-form-urlencoded\015\012",
    );

I deciphered that last part thanks to this Microsoft KB article: How To Use the PostData Parameter in WebBrowser Control.

Here's a short script demonstrating the technique. In my case, the target, showcgi.pl, just creates a CGI::Simple and prints the output of $cgi->Dump.

#!/usr/bin/env perl

package My::Poster;

use strict; use warnings;
use Const::Fast;
use Encode;
use Try::Tiny;
use Win32::OLE;
use Win32::OLE::Variant;
local $Win32::OLE::Warn = 3;

# http://msdn.microsoft.com/en-us/library/aa768360%28v=vs.85%29.aspx
const my %BrowserNavConstants => (
    navOpenInNewWindow => 0x1,
    navNoHistory => 0x2,
    navNoReadFromCache => 0x4,
    navNoWriteToCache => 0x8,
    navAllowAutosearch => 0x10,
    navBrowserBar => 0x20,
    navHyperlink => 0x40,
    navEnforceRestricted => 0x80,
    navNewWindowsManaged => 0x0100,
    navUntrustedForDownload => 0x0200,
    navTrustedForActiveX => 0x0400,
    navOpenInNewTab => 0x0800,
    navOpenInBackgroundTab => 0x1000,
    navKeepWordWheelText => 0x2000,
    navVirtualTab => 0x4000,
    navBlockRedirectsXDomain => 0x8000,
    navOpenNewForegroundTab => 0x10000,
);

sub new {
    my $class = shift;
    my $self = bless {} => $class;
    $self->init;
    return $self;
}

sub ie {
    my $self = shift;
    my $ie = shift;

    return $self->{ie} unless defined $ie;

    $self->{ie} = $ie;
    return;
}

sub init {
    my $self = shift;

    $self->ie(
        Win32::OLE->new(
            'InternetExplorer.Application',
            sub {
                my $ie = shift;
                try { $ie->Quit if $ie } catch { warn "$_\n" };
            },
        )
    );
    return;
}

sub post {
    my $self = shift;
    my ($url, $data) = @_;

    my $ie = $self->ie;

    my $flags = $BrowserNavConstants{navNoHistory} |
                $BrowserNavConstants{navNoReadFromCache} |
                $BrowserNavConstants{navNoWriteToCache} |
                $BrowserNavConstants{navEnforceRestricted} |
                $BrowserNavConstants{navNewWindowsManaged} |
                $BrowserNavConstants{navUntrustedForDownload} |
                $BrowserNavConstants{navBlockRedirectsXDomain}
    ;

    my $postdata = join '&', map join('=', @$_), @$data;
    $postdata = encode('UTF-8', $postdata);

    my $vPostData = Variant(VT_ARRAY|VT_UI1, length $postdata);
    $vPostData->Put($postdata);

    # http://msdn.microsoft.com/en-us/library/aa752133%28v=vs.85%29.aspx
    $ie->Navigate(
        $url,
        $flags,
        '_self',
        $vPostData,
        "Content-Type: application/x-www-form-urlencoded\015\012",
    );

    sleep 1 until $ie->{ReadyState} == 4;
    return $ie->Document->documentElement->innerHTML;
}

sub DESTROY {
    my $self = shift;
    try { $self->ie->Quit } catch { warn "$_\n" };
    return;
}

package main;

use strict; use warnings;

my $poster = My::Poster->new;

my $html = $poster->post(
    'http://test.localdomain:8080/cgi-bin/showcgi.pl',
    [
        [ var1 => 'Yağmur', ],
        [ var2 => 'Øl']
    ],
);

print $html if defined $html;

PS: I have no idea how well this works with UAC in Windows versions after XP SP3, so take it with a grain of salt. In my case, with local $CGI::Simple::PARAM_UTF8 = 1; in the CGI script, I got the expected output:

$VAR1 = bless( {
                 '.parameters' => [
                                    'var1',
                                    'var2'
                                  ],
                 '.globals' => {
                                 'DEBUG' => 0,
                                 'NO_UNDEF_PARAMS' => 0,
                                 'NO_NULL' => 1,
                                 'FATAL' => -1,
                                 'USE_PARAM_SEMICOLONS' => 0,
                                 'PARAM_UTF8' => 1,
                                 'DISABLE_UPLOADS' => 1,
                                 'USE_CGI_PM_DEFAULTS' => 0,
                                 'NPH' => 0,
                                 'POST_MAX' => 102400,
                                 'HEADERS_ONCE' => 0
                               },
                 'var1' => [
                             "Ya\x{c4}\x{9f}mur"
                           ],
                 '.fieldnames' => {
                                    'var1' => 1,
                                    'var2' => 1
                                  },
                 'var2' => [
                             "\x{c3}\x{98}l"
                           ],
                 '.crlf' => '
',
                 '.header_printed' => 1
               }, 'CGI::Simple' );

No comments:

Post a Comment