Coming Up for Air

Converting Many Images to One PDF

Wednesday, Sep 5, 2012 |

Converting Many Images to One PDF

Jason Lee 2012-09-05

I recently had the need to convert several scanned images into one multi-page PDF. While there are probably tools to help do this manually, I knew that there was a good chance I'd have to do something like this again, quite possibly with a large number of images, so I did what any good geek would do: I scripted it. In this entry, I'll show how I went about that.

For starters, let's take a look at the very small, simple Python script:

#!/usr/bin/python

import os,sys, PythonMagick
from pyPdf import PdfFileReader,PdfFileWriter

if  not ((len(sys.argv) > 2) and sys.argv[1].endswith('.pdf')):
    print "usage: images_to_pdf.py <finalname.pdf> <image1.pdf> <imagen.pdf>"
else:
    final_name = sys.argv[1]
    merged = PdfFileWriter()

    for file in sys.argv[2:]:
        print "Processing %s..." % (file)
        img = PythonMagick.Image()
        img.read(file)
        img.write('temp.pdf')
        pdf = PdfFileReader(open('temp.pdf'))
        for page in pdf.pages:
            merged.addPage(page)
        os.remove('temp.pdf')

    merged_file = open(final_name, mode='wb')
    merged.write(merged_file)
    merged_file.close()

There's not a lot to it, thanks in large part to PythonMagick and pyPDF . This script takes at least two parameters: the final name of the PDF, and at least on image file. The bulk of the work flow is this:

  • Create a PdfFileWriter object. This handles the heavy lifting in actually writing the PDF

  • Iterate over the image file names given
    • Create an Image object and read the image source into it

    • Write the image to a temporary PDF file. This implicitly converts the image to a PDF.

    • Read the temporary PDF into memory via PdfFileReader

    • For each page in the temporary PDF (which should be exactly 1), add it to the real, final PDF

    • Delete the temporary PDF

  • Write the newly constructed PDF to disk and exit

It's very simple, and pretty dumb (I added only enough error checking to make it work for me ;), and it may be a suboptimal use of the APIs, but it works pretty well for me. Hopefully, it will help someone else out.