Cameron Laird's personal notes on generation of PDF

Categorize PDF producers in this way:

[Explain Vasudev's xtopdf and StdinToPDF.py.]

Language bindings

[Explain iText, ReportLab, jsPDF (also available here) (now HTML5-equipped?), ...]

[Refer to Vasudevram's CreatingPDF Wiki, the severely-incomplete Wikipedia reference page ...]

Converters

How can one generate PDF which meets certain requirements? One strategy is to create something which meets the (visual) requirements, then transform it to PDF. This has been quite effective in our commercial practice. We've automated hundreds of thousands of PDFs, for instance, by first creating a corresponding HTML source (roughly), then transforming that source to PDF. We've also had occasion to employ such other display formats as TeX, PostScript, Sphinx, DOC, ...

[Point to published articles and commercial services related to this subject. Explain how this approach can fit a particular organization's licensing and skills MUCH better than "native" PDF.] [Write about TeX and DOC.]

DOC

For the mass of computer users, the "natural" way to produce a PDF instance is probably something like this:

  1. Edit a document in Microsoft Word; then
  2. Somehow render the document to PDF.
Consider the possibilities automation introduces: with a way to generate DOC->PDF, existing organizational corpora suddenly become available as PDF, with all the usual advantages [further document] (licensing, read-only, device-independence, ...) of the latter. With a means to generate DOC automatically (as Phaseit also does, though far less often than other formats here), we have a chain that produces desirable-looking PDF, by means with which there is wide familiarity. Maintenance is correspondingly inexpensive.

For the DOC->PDF step, there are many possibilities, including:

HTML

For years, I used Jan Kärrman's Perl-coded html2ps utility in a chain with Ghostscript (for ps2pdf) in production situations. However, as of March 2004, I began to rely on GPLed HTMLDOC, advertised as "a program that generates indexed HTML, PostScript, and PDF files from HTML 'source' ..." HTMLDOC has produced usable output from everything I've given it. Ghostscript, on the other hand, fails on certain output of html2ps. While I'm good enough with PostScript, Perl, and C to tackle the errors I've encountered so far, I currently find it more productive to rely on HTMLDOC. If someone from the html2ps or Ghostscript (or PStill, for that matter) projects wants reproducible symptoms and/or patches, I'll happily oblige.

Yes, HTMLDOC is both commercial and free. Note, by the way, that 1.9 of HTMLDOC will be the first release to support CSS. [Explain technical advantages and disadvantage of HTMLDOC.]

In 2007, I began to use iText also for PDF transformations. HTMLWorker is the natural interface. Early in 2011, the HTMLWorker team released a new implementation. Even this latest HTMLWorker, though, does not support <form> or several other standard HTML elements.

In 2010, I received a recommendation for wkhtmltopdf. For applications that involve CSS and/or put a premium on duplication on what end-users see in a Web browser, wkhtmltopdf is now our strongly-preferred solution.

All other HTML->PS or HTML->PDF products apparently don't automate well and/or are available only for Windows.

While I still have no experience with Win*-based activePDF WebGrabber, its function apparently is exactly to convert HTML to PDF.

PostScript

I've written several PostScript applications. With those in hand, of course, it's natural to render output as PDF.

Most of the world counts on ps2pdf. The only independent converter I've found so far is PStill.

TeX

Sphinx

On MacOS, Sphinx uses MacTex to render PDF--all 1.5 GB of MacTex. This clearly carries a lot of unnecessary bits. Apparently no one is much motivated to change this.

rst2pdf relies on ReportLab, and apparently also can adequately render Sphinx.


Cameron Laird's personal notes on generation of PDF/claird@phaseit.net