Scraping PDF documents without losing your sanity

The epiphany came when I was trying to extract usable information from a bunch of documents.

Some people insist on distributing essential information in PDF format making it very hard to make use of said information.

Now, I have never really made it past the table of contents of Adobe’s PDF Reference, and I can’t really figure out many of the available Perl modules dealing with PDFs. I know what goes into a PDF document (basically boxes with coordinates), but, just as I have never written a web server in Postscript either, I haven’t been able to go into this in depth.

One of the problems with utilities that naively convert PDF to text is that usually they do a straightforward translation of the layout which does a number on the order the text comes out. The location of an object on the page and it’s position in the object stream don’t really correspond very reliably to each other.

Thanks to Thomas Levine, I found out about pdftohtml.

At first, I was very frustrated … Then, I realized the value of the -xml option.

With this option, the PDF document is output as <page> and <text> elements. For example:

<page number="6" position="absolute"
  top="0" left="0" height="918" width="1188">
…
<text top="176" left="109" width="125"
  height="15" font="2">DATA RECORD </text>

This is extremely useful when trying to extract information. First, if the entity producing the document used consistent styling, the font attribute of the text elements can be used to select items of interest. However, multi-column documents are still a pain.

The key to my epiphany lies in sorting the text elements using a lexicographic ordering: Text on page 5 should come before text on page 7. Text in column one comes before text in column three. Text on line five in column two comes before text on line two in column three … See what I did there?

At first you might think it is OK to define columns using the left attributes of text elements. The problem is when some attributes for the data you want to extract are defined in section headers that can appear in the middle of a column. People will usually center the text in those headers (for visual aesthetic reasons), and therefore they will appear to be in a later column than the data items that follow.

This may seem obvious right now, but the solution came to me only after looking at the following plot:

That is, I need a mapping of ranges of left margins to columns.

Once that mapping is defined, text elements can be sorted into a natural reading order, and information can be extracted using usual methods.