Friday, September 10, 2010

Extract bullet lists from PowerPoint slides using Perl and Win32::OLE

This is based on my answer to another question on Stackoverflow.com. That answer works, but is made cumbersome by the fact that I ignored Win32::OLE::Enum completely and littered the code with unnecessary loop variables.

Well, such is life when you trying to decipher how interpret the structure of a PowerPoint slide.

Slides contain shapes. Shapes can be of various types. For our purposes, we are interested in shapes for which HasTextFrame property is true. A TextFrame has a TextRange. You can access the TextRange by characters, lines or paragraphs. For our purposes, accessing by paragraphs is the right thing to do.

Once we have a paragraph, we ship it off to the print_par routine. The code in this example prints non-bullet text as well, but treats bullet text specially. We first use the Bullet property of the ParagraphFormat associated with the paragraph at hand. In addition to the ppBulletNumbered and ppBulletUnnumbered types, there is also ppBulletNone, ppBulletMixed, and ppBulletPicture bullet types. Finally, there are a bazillion numbered bullet styles (see PpNumberedBulletStyle) which you should take into account if you care about such things.

Here is the script

#!/usr/bin/perl

use strict; use warnings;
use Try::Tiny;
use Win32::OLE;
use Win32::OLE::Const qw( Microsoft.PowerPoint );
use Win32::OLE::Enum;

$Win32::OLE::Warn = 3;

my $ppt = get_ppt();
binmode STDOUT, ':utf8';

my $presentation = $ppt->Presentations->Open('test.ppt', 1);
my $slides = Win32::OLE::Enum->new( $presentation->Slides );

SLIDE:
while ( my $slide = $slides->Next ) {
    my $name = $slide->Name;
    printf "=== Begin slide: %s ===\n", $name;

    my $shapes = Win32::OLE::Enum->new( $slide->Shapes );
    SHAPE:
    while ( my $shape = $shapes->Next ) {
        next SHAPE unless $shape->HasTextFrame;
        my $pars = Win32::OLE::Enum->new(
            $shape->TextFrame->TextRange->Paragraphs
        );
        PARAGRAPH:
        while ( my $par = $pars->Next ) {
            print_par( $par );
        }
    }
    printf "=== End slide: %s ===\n\n", $name;
}

$presentation->Close;

sub print_par {
    my $par = shift;

    my $indent = $par->IndentLevel;
    my $bformat = $par->ParagraphFormat->Bullet;
    my $btype = $bformat->Type;
    my $bchar;

    # see also PpNumberedBulletStyle
    $bchar = $btype == ppBulletNumbered   ? $bformat->Number
           : $btype == ppBulletUnnumbered ? chr $bformat->Character
           : $btype == ppBulletMixed      ? '[X]'
           : $btype == ppBulletPicture    ? '[IMG]'
           : '';

    my $text = $par->Text;
    $text =~ s/\s+$//;

    print(
        "\t" x ($indent - 1),
        $bchar ? ($bchar, ' ') : '',
        $text,
        "\n",
    );
}

sub get_ppt {
    my $ppt;

    try { $ppt = Win32::OLE->GetActiveObject('PowerPoint.Application') }
    catch { die $_ }
    ;

    unless ( $ppt ) {
        $ppt = Win32::OLE->new(
            'PowerPoint.Application', sub { $_[0]->Quit }
        ) or die sprintf(
            'Cannot start PowerPoint: %s', Win32::OLE->LastError
        );
    }

    return $ppt;
}

And here is sample output from using with a very simple PowerPoint presentation:

=== Begin slide: Slide1 ===
This is a test presentation
subtitle
=== End slide: Slide1 ===

=== Begin slide: Slide2 ===
A bullet list
This is not a bullet
• Ya da
    – Da da
    – Ga ga
• Du da
    1 Nu da
        [IMG] Do da
=== End slide: Slide2 ===

=== Begin slide: Slide3 ===
A numbered list
1 One
    1 One a
    2 One b
2 Two
    1 Two I
    2 Two II
=== End slide: Slide3 ===

This page has been translated into Spanish language by Maria Ramos from Webhostinghub.com/support/edu.

No comments:

Post a Comment