Cameron Laird's personal notes on PDF

"PDF" is Adobe's acronym for "Portable Document Format", a proprietary specification for a device- and platform-independent display format. It's realized as a sort of wrapped and compressed PostScript. Lauren Leurs lucidly describes both "An Introduction to PDF" and "The history of PDF" from a "prepress" perspective. The two big commercial sites for PDF information, aside from Adobe, are PlanetPDF and PDFZone.

In December 2001, I published a breezy introduction to no-cost PDF resources for my "Open Sources" column. I also wrote "Yes You Can" (August 2002), "Low-cost PDF" (April 2003), "PDF for C and C++ Developers" (October 2003), and ... For more information on the products described there, start with the home pages of PDFlib, PJ, and ReportLab. I'll probably write more on ReportLab programming and business strategy throughout 2002, perhaps beginning with a piece on PDF security; write me if there's a particular aspect you want me to cover. Note that the Ohio Department of Transportation's open-source JavaPDF is another product worth considering along with PDFlib, PJ, ReportLab, and all of "CPAN's PDF directory.

I recommend reading "Kyler Laird's PDF utilities" both for the usefulness of the tools and hyperlinks available there, and also for the correct engineering commentary. Dave Toureztky maintains a "Gallery of Adobe Remedies" with more comprehensive information on PDF security, including a pointer to a Perl script which decrypts PDF.

Etymon™'s PJ class library coded in Java includes a command-line utility, pjscript.

Acquaintances tell me good things about Xpdf's ability to extract (plain)text content from PDF sources. Xpdf is an X-oriented PDF viewer.

Addison-Wesley published the PDF Reference on dead trees. It's also available online, as are the draft specification for PDF 1.5 and ... a different specification.

[Describe other tools.]

[Editorialize on PDF role.]

"PDF: Unfit for Human Consumption" is Jakob Nielsen's hysterical--that is, effectively publicized--mid-July 2003 attack on the "usability" of the format. [explain errors, obscurity of correct observations]

Freeware GhostWord plugs into Word, PowerPoint, and Excel, and automates production of .ps, and, from there, .pdf. Thanks to Dr. Gregory Guthrie for tipping me off to GhostWord.

ReportLab programming

Along with the references above, "Yes You Can" and "PDF for the server" (but see important import_HTML note below) touch on ReportLab programming. Readers asked for example usages. Here are a few:

copyPages

Is copyPages still not in the standard ReportLab documentation? In which public release did it first appear? As October 2002 begins, it looks as though it's only in the for-fee library, but that's not true ... [collect details, explain.] In any case, here's how you can append one PDF source to another, while preserving "bookmarks":
   from pageCatcher import copyPages
   from reportlab.pdfgen import canvas 

   def makeAppendedResult(result, first_source, second_source):
      c = canvas.Canvas(result)
      copyPages(first_source, c);
      copyPages(first_source, c);
      c.showOutline()
      c.save()
       

import_HTML

Ugh. My apologies, folks; in the article titled "PDF for the Server" I identified import_HTML as part of ReportLabs' library. This is simply false, and I'll make a point of correcting it in a future column.

The import_HTML I use is this:

# In response to a correspondent's comment, I replied:
#   "Bleah; ignore the Python.  I'll comment it to make this
#    clear:  the point is just that HTML->PDF is achieved as
#    HTML->PS->PDF, the second step is canonical, and the
#    first is done with a specific command-line tool."

# Copyright Kyler Laird 2001.
# Freely redistributable.
#

# Import from HTML.
def import_HTML(self, html, color=0, style=None, landscape=0, number=0):
    infile = self._write_string_to_tmpfile(html, ext='HTML')
    self.outfile = self._mktemp('ps')

    options = []

    if number:
        options.append('--number')
        options.append('--startno %d' % number)

    if landscape:
        options.append('--landscape')

    if color:
        options.append('--colour')

    if style:
        stylefile = self._write_string_to_tmpfile(style,
ext='style')
        # options.append('--style "%s"' % (style))
        options.append('-f "%s"' % (stylefile))

    command_string = "html2ps %s -o %s %s" %
(string.join(options, ' '), self.outfile, infile)
    self._run(command_string)
    return 
    
There are several ways to render HTML as PS.

[Explain significance.] [Compare to htmldoc.]

      my_html_source = """
         <HTML>
         <HEAD><TITLE>%s</TITLE></HEAD>
         <BODY><H1>%s</H1>
         %s
         </BODY>
         </HTML>""" % (title, title, content)
      my_document.import_HTML(my_html_source)
   

Cameron Laird's personal notes on PDF/claird@phaseit.net