Cameron Laird's personal notes on extraction of content from PDF


My first advice to anyone considering a project that involves retrieval of content from PDF documents is immediate and clear: don't do that. If at all possible, choose a different approach. PDF instances originated somewhere else: they are office-automation or engineering documents rendered to PDF, or engineering reports SaveAs-d PDF, or database-backed analyses generated into PDF. Obtain the original data, and avoid the fragile, error-prone round trip though PDF.

For more on why to avoid content extraction from PDF, see, for instance, ProPublica's description of its experience.

Two major exceptions modify this advice: "search", very broadly considered; and retrieval of tabulated data.

Approximate content extraction

Search services of course include PDF in their corpora. They do this by careful construction of rather complicated workflows involving optical character recognition, pattern recognition, approximate matching, and several "big data" or "expert" specialties. The result is rarely, "the exact text of this document is $TEXT", but more often, "this document almost certainly is a recipe for blueberry muffins (or a wind-tunnel test run, or a biographical profile of Pāṇini, or whatever the case is)."

The marketplace for content-extraction products and services is rich, dynamic, and frustrating. I'll eventually comment on a few of the offerings here.

Retrieval from PDF documents of tabulated data

A small minority of all PDF content-extraction projects actually aim to retrieve tabulated data from documents only available as PDF. Examples include: published government reports of license statuses; scientific summaries of experimental trials; summaries from non-profits of their accomplishments; organizational memoranda which emphasize accounting or engineering data; and so on. The regularity of the content and its format make it feasible to retrieve precise data from many such documents.

Our boutique software consultancy, Phaseit, Inc., is expert in this sort of development. We're happy to respond quickly with bids on projects of all scales, either fixed-price or hourly. When appropriate, we direct inquiries to competing services and products, including a few free-free tools.

Cameron Laird's personal notes on extraction of content from PDF/