Read, merge, split PDF in Python

GemBox.Pdf is a .NET library that enables you to process PDF files from any .NET application. But it's also a COM-accessible library that you can use in Python as well.

System Requirements

To use GemBox.Pdf in Python, you'll need to:

  1. Download and install GemBox.Pdf.
  2. Expose GemBox.Pdf to COM Interop with Regasm.exe tool:
    :: Add GemBox.Pdf to COM registry for x86 (32-bit) applications.
    C:\Windows\Microsoft.NET\Framework\v4.0.30319\RegAsm.exe [path to installed GemBox.Pdf.dll]
    
    :: Add GemBox.Pdf to COM registry for x64 (64-bit) applications.
    C:\Windows\Microsoft.NET\Framework64\v4.0.30319\RegAsm.exe [path to installed GemBox.Pdf.dll]
  3. Install Python for Windows extension:
    :: Install Python extension for Windows.
    pip install pywin32

Working with PDF files in Python

When working with PDF, we usually need to perform some actions, such as merging documents with familiar topics or splitting a file to extract a particular page from it. See the following example to learn how to read a PDF file, split it into multiple files, and merge PDF files using Python.

Merging PDF files into one and splitting PDF pages into multiple PDF files in Python
Screenshot of merged PDF files and split PDF pages
import os
import win32com.client as COM

# Create ComHelper object.
comHelper = COM.Dispatch("GemBox.Pdf.ComHelper")
# If using the Professional version, put your serial key below.
comHelper.ComSetLicense("FREE-LIMITED-KEY")

fileNames = ["\\%#MergeFile01.pdf%", "\\%#MergeFile02.pdf%", "\\%#MergeFile03.pdf%"]

################
### Read PDF ###
################

# Load PDF file.
document1 = comHelper.Load(os.getcwd() + fileNames[0])
pages1 = document1.Pages

# Read text content from each PDF page.
for i1 in range(pages1.Count):
    page = pages1.Item(i1)
    print(page.Content.ToString() + "\n")

document1.Dispose()

#################
### Merge PDF ###
#################

# Create PdfDocument object.
document2 = COM.Dispatch("GemBox.Pdf.PdfDocument")

# Merge multiple PDF files into a single PDF file.
for fileName in fileNames:
    sourceDocument = comHelper.Load(os.getcwd() + fileName)
    sourcePages = sourceDocument.Pages

    for i2 in range(sourcePages.Count):
        document2.Pages.AddClone(sourcePages.Item(i2))

    sourceDocument.Dispose()

comHelper.Save(document2, os.getcwd() + "\\Merge Files.pdf")
document2.Dispose()

#################
### Split PDF ###
#################

# Load PDF file.
document3 = comHelper.Load(os.getcwd() + "\\Merge Files.pdf")
pages3 = document3.Pages

# Split a single PDF file into multiple PDF files.
for i3 in range(pages3.Count):
    destinationDocument = COM.Dispatch("GemBox.Pdf.PdfDocument")
    destinationDocument.Pages.AddClone(pages3.Item(i3))
    
    comHelper.Save(destinationDocument, os.getcwd() + "\\Page" + str(i3) + ".pdf")
    destinationDocument.Dispose()

document3.Dispose()

Wrapper Library

Not all members of GemBox.Pdf are COM-accessible because of limitations like unsupported static and overload methods. That is why you can use the ComHelper class which provides alternatives for some members that cannot be called with COM Interop.

However, if you need to use many GemBox.Pdf members from Python, we recommend creating a .NET wrapper library. Your wrapper library should do all the work and expose a minimal set of classes and methods to the unmanaged code.

It will enable you to take advantage of GemBox.Pdf's full capabilities, avoiding any COM limitations, and improving performnace by reducing the number of COM Callable Wrappers created at runtime.

See also


Next steps

GemBox.Pdf is a .NET component that enables developers to read, merge and split PDF files or execute low-level object manipulations from .NET applications in a simple and efficient way.

Download Buy