The focus of this page ('anyone think I should re-do it as a Wiki?) is on the products available to convert to and from PDF images. IDR Solutions explains the challenge.
Ghostscript/Ghostview answers many questions, at least partially.
David Boddie's pdftools and David Leonard's PDFFile provide interesting Python-coded raw materials for those unafraid of dirtying their hands with programming. Early in 2005, one appreciated correspondent wrote me that the latter "handles things like decryption better." From what I can tell, PDFFile and python-pdftools do not write; they only read.
My clients often need to build reports which simply sequence existing
.ps source. That's
a far bigger undertaking than you might think, as Matthew Skala
(in fact, I disagree with a few of his details, but he certainly
gets the frustration right). Here
are a few of the alternatives with which I've spent time:
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite \ -sOutputFile=result.pdf input1.pdf input2.pdf
Java-based iText is a very widely-used library for PDF management [Explain mailing list, 1t3xt, and such.] Be aware that, as maintainer Paulo Soares has written [find reference in difficult mailing list], "If you're using it in an intranet you don't have to do anything. If you're exposing the service to the exterior either you provide the source code of your application or you buy a commercial license." He was writing about iText 5. Earlier releases could be freely embedded in Web applications.
The iText creators (hope to?) receive significant income from the book. While they generously make a wealth of information available on-line, I don't find it organized for my convenience. Among the highlights are:
pdfjoin, a member of the TeX-based pdfjam suite.
pdfjoinhandles instances that cause pdftk, pyPdf, and iText to stumble. Phaseit is likely to continue to invest in at least a couple of these different open-source projects.
pdftk A=in1.pdf B=in2.pdf cat B1 A1-7even A1-7odd \ output out1.pdfpdftk works well, in my limited testing. It does not bookmark. It does manage background watermarks and foreground stamps. Bruno Lowagie tells me that, while Sid Steward has left computing for a family business, iText Software Corporation "has plans to set up support for PdfTk". Here, incidentally, is an interview with Bruno.
Sometimes it's necessary to decrypt a PDF instance. qpdf is an example of a utility that helps.
At least, that's my usual first response, although, as 2004 begins, a couple of products are making me soften that stance. I understand all the situations that make text-extraction appear to be desirable; I've lived through most of them myself. As several sages have counseled, however, from a programmatic standpoint, "think of PDF as paper", by which they mean you could use scissors and glue on it, but there's almost certainly a better way. Almost always, you're--we're--better off going upstream to the data where the PDFs originated. I'm happy to help analyze specific situations on a consulting basis to determine whether there's an appropriate alternative to text-extraction, and also to help your organization implement the text-extraction method that's best for it. For more on the subject, and especially the possibilities for tabulated data, see this page focused exclusively on content extraction.
If you insist on extracting text from PDF, and choose not to engage our consultancy, you're likely to find your answer from the following list. This list remains partial; you're welcome to write me to ask that I unpack more of my notes, if you have specific requirements none of these meet.
pdftotext.exe. [doesn't handle compression?]
I've exhorted developers often in my more formal publications not to retrieve text from PDF; a recent example was "Friends don't let friends ...", in Smart Development.
The most common legitimate reason to render PDF to text is in combination with some sort of search; that's certainly the application of this sort I most often automate. Search and "content management" specialists are generally aware of the issues involved, and often offer their own PDF extractors as plug-ins or add-ons.
For immediate results, Zamzar is a Web application that quickly converts one or a small number of PDF-defined pages [also mention YouConvertIt, Neevia]. Even quicker, for those running Mac OS, is simply to open Preview and SaveAs JPG.
An abundance of installable desktop applications include the capability to visualize a PDF page as, for example, JPG. Among them are:
Finally, for automation, ...
In 2011, I moved the contents of this section to a new page.
In 2011, I moved the contents of this section to a new page.
"PDF mollifiers fill crucial role" tells a bit more about what I think on this subject.
Here is the source mentioned in a "Smart Development post called "PDF pagination only takes a few lines". Phaseit, Inc. holds the copyright to this source. Use as you wish. If you make weapons with this code, are ill-humored, claim you originated it yourself, or think a court will support a lawsuit against Phaseit ... well, it's your soul that suffers.
In 2010, I'm testing APDF Number.
Adobe certainly wants people--especially those who control budget decisions--to think of it as the vendor-of-preference for all such needs. I respect Adobe for their business success and technical achievements. My experience as a front-line customer of theirs is ... mixed. My first instinct is to look for alternatives.
The dominant producers of PDF documents in the current market are Acrobat and Word. I suspect someone has reasonably accurate measurements of the share each holds; my rough impression is that the latter dominates. It certainly is feasible to automate Word in principle. While most Word scripters use VBA, I rely most on Tcl or Python ... There should be no effective barriers to full automation using Word's built-in facilities.
Word, however, emits bad PDF, and is often slow and unreliable, at least for the tasks that matter to me. Adobe frustrates me; I have a terrible history at trying to find out the simplest product information from the company. When I want "industrial-strength" automation, I turn to Antiword or OpenOffice. The latter produces higher-quality PDF than Word, and is more open about its scripting capabilities, at least on an ideologic level.
For special purposes, I've built even more involved "production lines" involving intermediate steps with PS, TeX, and other formats and technologies.
PDF Writer Pro installs itself as a Windows printer driver which gives Windows applications the ability to write-to-PDF without Acrobat.
Enfocus Pitstop is a PDF preflight and editing package for the print industry.
PDF Crystal ...
[Explain capabilities and applicability of pdflatex, pdfpages ...]
[I need to explain ReportLab, html2ps, ...]