Cameron Laird's personal notes on PDF conversion utilities

Multitudes of FAQs and similar references for PDF information have been published in the past. As of 2003, I've found none that I regard as convenient and well-maintained in regard to the "filters" that transform files to and from PDF, not even the Conversion tools page of PDFZone or PlanetPDF's Extraction page--so I'll start my own.

The focus of this page ('anyone think I should re-do it as a Wiki?) is on the products available to convert to and from PDF images. IDR Solutions explains the challenge.

Ghostscript/Ghostview answers many questions, at least partially.

David Boddie's pdftools and David Leonard's PDFFile provide interesting Python-coded raw materials for those unafraid of dirtying their hands with programming. Early in 2005, one appreciated correspondent wrote me that the latter "handles things like decryption better."

Concatenators

My clients often need to build reports which simply sequence existing (or generated) .pdf and/or .ps source. Here are a few of the alternatives: A workaround for extreme situations is to "mollify" the PDF by roundtripping it through PS.

Products that extract text from PDF

Don't do it.

At least, that's my usual first response, although, as 2004 begins, a couple of products are making me soften that stance. I understand all the situations that make text-extraction appear to be desirable; I've lived through most of them myself. As several sages have counseled, however, from a programmatic standpoint, "think of PDF as paper", by which they mean you could use scissors and glue on it, but there's almost certainly a better way. Almost always, you're--we're--better off going upstream to the data where the PDFs originated.

If you insist on extracting text from PDF, get help, probably from the following list. This list remains partial; you're welcome to write me to ask that I unpack more of my notes.

Products that render PDF as DOC or RTF

Products that transform HTML to PS or PDF

For years, I've used Jan Kärrman's Perl-coded html2ps utility in a chain with Ghostscript (for ps2pdf) in production situations. However, as of March 2004, I've begun to rely on GPLed HTMLDOC, advertised as "a program that generates indexed HTML, PostScript, and PDF files from HTML 'source' ..." HTMLDOC has produced usable output from everything I've given it. Ghostscript, on the other hand, fails on certain output of html2ps. While I'm good enough with PostScript, Perl, and C to tackle the errors I've found so far, I currently find it more productive to rely on HTMLDOC. If someone from the html2ps or Ghostscript (or PStill, for that matter) projects wants reproducible symptoms and/or patches, I'll happily oblige.

Yes, HTMLDOC is both commercial and free.

In 2007, I began to use iText also for PDF transformations.

All other HTML->PS or HTML->PDF products apparently don't automate well and/or are available only for Windows.

While I still have no experience with Win*-based activePDF WebGrabber, its function apparently is exactly to convert HTML to PDF.

Products that transform PDF back to PS

The Glyph & Cog, LLC xpdf includes a pdftops utility.

Products that transform PS to PDF

Most of the world counts on ps2pdf.

The only independent converter I've found so far is PStill.

Products that validate PDF or PS

Automation

I often field questions such as, "I need to programmatically convert Office files to PDF. Is that possible / easy? How is that done?" I'll start with a few personal comments.

Adobe certainly wants people to think of it as the vendor-of-preference for all such needs. I respect Adobe for their business success and technical achievements. My experience as a front-line customer of theirs is ... mixed. My first instinct is to look for alternatives.

The dominant producers of PDF documents in the current market are Acrobat and Word. I suspect someone has reasonably accurate measurements of the share each holds; my rough impression is that the latter dominates. It certainly is feasible to automate Word in principle. While most Word scripters use VBA, I rely most on Tcl or Python ... There should be no effective barriers to full automation using Word's built-in facilities.

Word, however, emits bad PDF, and is often slow and unreliable, at least for the tasks that matter to me. Adobe frustrates me; I have a terrible history at trying to find out the simplest product information from the company. When I want "industrial-strength" automation, I turn to Antiword or OpenOffice. The latter produces higher-quality PDF than Word, and is more open about its scripting capabilities, at least on an ideologic level.

For special purposes, I've built even more involved "production lines" involving intermediate steps with PS, TeX, and other formats and technologies.


Miscellaneous PDF Products

PDF Writer Pro installs itself as a Windows printer driver which gives Windows applications the ability to write-to-PDF without Acrobat.

Enfocus Pitstop is a PDF preflight and editing package for the print industry.

PDF Crystal ...

[Explain capabilities and applicability of pdflatex, pdfpages ...]

Storypad ...


Cameron Laird's personal notes on PDF conversion utilities/claird@phaseit.net