Python on File Format Blog

Working with PDF files in Python

Wed, 29 Jan 2025 00:00:00 +0000

Last Updated: 29 Jan, 2025

In this article, we will guide you on how to work with PDF files using Python. For this, we’ll utilize the pypdf library.

Using the pypdf library, we’ll demonstrate how to perform the following operations in Python:

Extracting text from PDFs
Rotating PDF pages
Merging multiple PDFs
Splitting PDFs into separate files
Adding watermarks to PDF pages

Note: This article covers a lot of valuable details, so feel free to skip to the sections that interest you the most! The content is organized for easy navigation, so you can quickly focus on what’s most relevant to you.

Sample Codes

You can download all the sample code used in this article from the following link. It includes the code, input files, and output files.

Code Examples and Input Files for Working with PDF Files in Python

Install pypdf

To install pypdf, simply run the following command in your terminal or command prompt:

pip install pypdf

Note: The above command is case-sensitive.

1. Extracting Text from a PDF File Using Python

Code Explanation

1. Creating a PDF Reader Object

reader = PdfReader(pdf_file)

PdfReader(pdf_file) loads the PDF file into a reader object.
This object allows access to the pages and their content.

2. Looping Through Pages

for page_number, page in enumerate(reader.pages, start=1):

reader.pages returns a list of pages in the PDF.
enumerate(..., start=1) assigns a page number starting from 1.

3. Printing Extracted Text

    print(f"Page {page_number}:")
    print(page.extract_text())
    print("-" * 50)  # Separator for readability

page.extract_text() extracts text content from the current page.
The script prints the extracted text along with the page number.
"-" * 50 prints a separator line (--------------------------------------------------) for better readability.

Input PDF File Used in the Code

Input File: Download Link

Output of the Code

2. Rotating PDF Pages Using Python

Code Explanation

The code basically rotates the first page by 90° clockwise and saves the modified PDF without affecting other pages.

1. Import Required Classes

from pypdf import PdfReader, PdfWriter

PdfReader: Reads the input PDF.
PdfWriter: Creates a new PDF with modifications.

2. Define Input and Output File Paths

input_pdf = "pdf-to-rotate/input.pdf"
output_pdf = "pdf-to-rotate/rotated_output.pdf"

The script reads from input.pdf and saves the modified file as rotated_output.pdf.

3. Read the PDF and Create a Writer Object

reader = PdfReader(input_pdf)
writer = PdfWriter()

reader loads the existing PDF.
writer is used to store the modified pages.

4. Rotate the First Page by 90 Degrees

page = reader.pages[0]
page.rotate(90)  # Rotate 90 degrees clockwise
writer.add_page(page)

Extracts page 1, rotates it 90 degrees, and adds it to the new PDF.

5. Add Remaining Pages Without Changes

for i in range(1, len(reader.pages)):
    writer.add_page(reader.pages[i])

Loops through the remaining pages and adds them as they are.

6. Save the New PDF

with open(output_pdf, "wb") as file:
    writer.write(file)

Opens rotated_output.pdf in write-binary mode and saves the new PDF.

7. Print Confirmation

print(f"Rotated page saved to {output_pdf}")

Displays a success message.

Input PDF Used in the Code and Its Rotated Output

Input PDF File: Download Link
Output Rotated PDF File: Download Link

Screenshot

3. Merging PDF Files Using Python

This Python script demonstrates how to merge multiple PDF files from a directory into a single PDF using the PyPDF library.

Code Explanation

This script automatically merges all PDF files found in the specified directory (pdfs-to-merge) into a single output file (merged_output.pdf).
It ensures the output directory exists and adds each PDF’s pages in the order they are listed.
It outputs the final merged file in the output-dir subdirectory.

Code Breakdown

1. Import Libraries

import os
from pypdf import PdfReader, PdfWriter

os: Used to interact with the file system, such as reading directories and managing file paths.
PdfReader: Reads the content of a PDF file.
PdfWriter: Creates and writes a new PDF file.

2. Define Directory and Output File

directory = "pdfs-to-merge"
output_file = "output-dir/merged_output.pdf"

directory: Specifies the folder where the PDF files are stored.
output_file: Defines the output path and name of the merged PDF.

3. Create Output Directory if It Doesn’t Exist

os.makedirs(os.path.join(directory, "output-dir"), exist_ok=True)

This ensures the output directory exists, and if it doesn’t, it creates it.

4. Create a PdfWriter Object

writer = PdfWriter()

writer is used to collect and combine all the pages from the PDFs.

5. Iterate Over All PDF Files in the Directory

for file_name in sorted(os.listdir(directory)):
    if file_name.endswith(".pdf"):
        file_path = os.path.join(directory, file_name)
        print(f"Adding: {file_name}")

This loop goes through all files in the specified directory, checking for files with the .pdf extension. It uses sorted() to process them in alphabetical order.

6. Read Each PDF and Append Pages to the Writer

reader = PdfReader(file_path)
writer.append(reader)

For each PDF, PdfReader reads the file, and then all pages from that PDF are appended to writer.

7. Write the Merged PDF to an Output File

output_path = os.path.join(directory, output_file)
with open(output_path, "wb") as output_pdf:
    writer.write(output_pdf)

After collecting all pages, writer.write() writes the merged PDF to the specified output path.

8. Print Confirmation

print(f"Merged PDF saved as: {output_path}")

Prints a success message confirming the location of the saved merged PDF.

Input PDF Files Used in the Code and the Merged Output PDF

Input PDF Files: Download Link
Merged Output PDF: Download Link

4. Splitting a PDF Using Python

Code Explanation

The above Python script splits a PDF into separate pages using the PyPDF library. It first ensures that the output directory exists, then reads the input PDF file. The script loops through each page, creates a new PdfWriter object, and saves each page as an individual PDF file. The output files are named sequentially (e.g., page_1.pdf, page_2.pdf) and stored in the output-dir folder. Finally, it prints a confirmation message for each created file and notifies when the process is complete.

Input PDF and Split Output Files

Input PDF File: Download Link
Split Output Files: Download Link

5. Adding a Watermark to a PDF Using Python

You can add a watermark to a PDF using the PyPDF library by overlaying a watermark PDF onto an existing PDF. Make sure the watermark PDF has only one page so it applies correctly to each page of the main PDF.

Code Explanation

The above Python script reads an input PDF, extracts a one-page watermark PDF, overlays the watermark on each page of the input PDF, and saves the final watermarked PDF.

Code Breakdown

Here’s a brief explanation of each part

1. Import Required Classes

from pypdf import PdfReader, PdfWriter

PdfReader is used to read existing PDFs.
PdfWriter is used to create and write a new PDF.

2. Define File Paths

input_pdf = "pdf-to-watermark/input.pdf"
watermark_pdf = "pdf-to-watermark/watermark.pdf"
output_pdf = "pdf-to-watermark/output_with_watermark.pdf"

input_pdf: The original PDF to which the watermark will be added.
watermark_pdf: A separate one-page PDF that serves as the watermark.
output_pdf: The output file that will contain the watermarked pages.

3. Read PDFs

reader = PdfReader(input_pdf)
watermark = PdfReader(watermark_pdf)

reader: Reads the input PDF.
watermark: Reads the watermark PDF.

4. Create a Writer Object

writer = PdfWriter()

This will be used to create the final watermarked PDF.

5. Extract Watermark Page

watermark_page = watermark.pages[0]

Assumes that the watermark PDF has only one page, which is used to overlay on all pages.

6. Loop Through Input PDF Pages & Merge Watermark

for page in reader.pages:
    # Merge the watermark with the current page
    page.merge_page(watermark_page)
    
    # Add the merged page to the writer
    writer.add_page(page)

Iterates through each page of input_pdf.
merge_page(watermark_page) overlays the watermark on top of the current page.
Adds the modified page to the writer.

7. Save the Watermarked PDF

with open(output_pdf, "wb") as output_file:
    writer.write(output_file)

Writes the modified pages into a new PDF file.

8. Print Confirmation

print(f"Watermarked PDF saved as: {output_pdf}")

Prints the output file path for confirmation.

Input PDF, Watermark PDF, and Output Watermarked PDF

Input PDF File: Download Link
Watermark PDF File: Download Link
Output Watermarked PDF File: Download Link

Screenshot

Conclusion

In this guide, we explored essential PDF operations in Python, including extracting text, rotating pages, merging, splitting, and adding watermarks. With these skills, you can now build your own PDF manager and automate various PDF tasks efficiently.

Extract Text from PDF File Using Python

Wed, 15 Jan 2025 00:00:00 +0000

Last Updated: 15 Jan, 2025

Extract Text from PDF File Using Python

In this article, we will let you know how to extract text from PDF file using Python.

PDF stands for Portable Document Format is a popular digital document format. This format is designed to allow documents to be viewed or shared easily and reliably, regardless of software, hardware or operating system. PDF files have the extension .pdf.

To extract text from a PDF file using Python, these libraries are commonly used. We will show you how to extract text from a PDF using both of them.

How to Extract Text from a PDF File Using pypdf in Python

Here are the steps.

Install pypdf
Run the code given in this article
See the output

Install pypdf

You can install pypdf using the following command

pip install pypdf

Sample Code to Extract Text from PDF using pypdf

sample.pdf - Download Link (This sample PDF will be used in the code, but you can certainly use your own PDF.)

screenshot of sample.pdf

Code

Here is a complete code example for extracting text from a PDF using pypdf.

Output

Here is the output of the sample code provided above.

How to Extract Text from a PDF File Using PyMuPDF in Python

Here are the steps.

Install PyMuPDF
Run the code given in this article
See the output

Install PyMuPDF

Install PyMuPDF, also known as fitz, using the following command.

pip install pymupdf

Sample Code to Extract Text from PDF using PyMuPDF

We used the same pdf as used before

sample.pdf - Download Link (This sample PDF will be used in the code, but you can certainly use your own PDF.)

Code

Here is a complete code example for extracting text from a PDF using PyMuPDF.

Output

Here is the output of the sample code provided above.

Conclusion

In this article, we provide a sample Python code, a sample file, and their output to demonstrate how to extract text from a PDF using two libraries: PyPDF and PyMuPDF.

If you have any questions or encounter any issues while running the code, feel free to leave a comment in our forums!

Convert PDF to Image in Python

Sat, 04 Jan 2025 00:00:00 +0000

Last Updated: 27 Jan, 2025

How to Convert PDF to Image in Python: A Step-by-Step Guide

Converting PDF files into image formats like JPEG or PNG can be extremely useful, especially for scenarios where you need to extract images from a PDF, present a preview of the document, or work with visual data. Python, being a versatile programming language, offers multiple ways to perform this task efficiently.

In this guide, we’ll walk you through a step-by-step process of converting a PDF to an image in Python. You’ll learn how to do this using popular Python libraries, examples of code, and helpful troubleshooting tips. We will also provide you complete code and its output images and sample PDF used inside it.

What You Need to Convert PDF to Image in Python

Before we jump into the code, let’s make sure you have the right tools to get started. For this task, you’ll need to install the following Python libraries:

Pillow: A popular Python Imaging Library (PIL) that is often used for opening, manipulating, and saving image files.
pdf2image: This library helps you convert PDF pages to images in Python. It uses Poppler for rendering PDF pages into images.

Installing the Required Libraries

You can install these libraries using pip:

pip install pillow pdf2image

If you don’t have Poppler installed on your system, you may need to install it separately. Check the installation guide for your platform here.

Step-by-Step Guide on Converting PDF to Image in Python

Step 1: Import the Necessary Libraries

Start by importing the necessary Python libraries:

from pdf2image import convert_from_path
from PIL import Image

Step 2: Convert PDF to Images

With the libraries imported, you can now convert a PDF file to images. Here’s how you do it:

# Convert PDF to images
images = convert_from_path('yourfile.pdf')

# Save each page as an image
for i, image in enumerate(images):
    image.save(f'page_{i}.jpg', 'JPEG')

Explanation of the Code:

The convert_from_path() function converts the PDF file into a list of PIL image objects.
We then loop through the images and save each page of the PDF as a separate image (in this case, JPEG format).

Step 3: Optional – Convert to Other Image Formats

You can easily convert the images to other formats, like PNG, by changing the format in the image.save() method:

image.save(f'page_{i}.png', 'PNG')

Complete Code

Here is the complete code. Simply copy it, save it with any name and the .py extension, and then execute it. For example, you can name it convert_pdf_to_images.py.

Before executing, just update the pdf_path variable to point to the path of your input PDF file.

Download the Sample PDF and View Its Screenshot

You can use any PDF, but for the sake of running and testing this code, we used this specific PDF.

Download Sample PDF

Output Images Generated by the Code

page_1.jpg
page_2.jpg
page_3.jpg

Alternative Methods to Convert PDF to Image in Python

While pdf2image and Poppler are widely used, there are other methods to convert PDF to image without needing Poppler. For example:

Using PyMuPDF (fitz): This library also allows you to extract images from PDFs and manipulate them.

pip install pymupdf

Example code:

import fitz  # PyMuPDF

# Open the PDF file
doc = fitz.open("yourfile.pdf")

# Loop through each page and convert to image
for page_num in range(len(doc)):
    page = doc.load_page(page_num)
    pix = page.get_pixmap()
    pix.save(f"page_{page_num}.png")

This method works without requiring Poppler and can be an alternative if you’re facing installation issues.

Common Errors and Troubleshooting

While converting PDFs to images in Python is generally straightforward, you might encounter some issues. Here are a few common errors and their solutions:

Error: OSError: cannot identify image file
- This typically happens if the PDF is not properly rendered. Ensure Poppler is installed correctly and is accessible from your Python environment.
Error: RuntimeError: cannot open image file
- This error can occur if you’re trying to open an image format that is unsupported. Double-check the format you’re saving the image in (JPEG, PNG, etc.) and ensure that Pillow supports it.

Conclusion

Converting PDF documents to images in Python is easy with the help of libraries like pdf2image and Pillow. Whether you’re looking to extract images from a PDF or simply want to display each page as a picture, this guide has shown you how to do it step by step.

Remember, depending on your project needs, you can also explore other Python libraries like PyMuPDF to achieve similar results.

If you have any questions or run into any issues while implementing this solution, feel free to leave a comment in our forums!

If this guide helped you, don’t forget to share it with others, and explore our other helpful guides for more coding tips and tricks!

Python on File Format Blog

Working with PDF files in Python

Sample Codes

Install pypdf

1. Extracting Text from a PDF File Using Python

Code Explanation

Input PDF File Used in the Code

Output of the Code

2. Rotating PDF Pages Using Python

Code Explanation

Input PDF Used in the Code and Its Rotated Output

3. Merging PDF Files Using Python

Code Explanation

Input PDF Files Used in the Code and the Merged Output PDF

4. Splitting a PDF Using Python

Code Explanation

Input PDF and Split Output Files

5. Adding a Watermark to a PDF Using Python

Code Explanation

Input PDF, Watermark PDF, and Output Watermarked PDF

Conclusion

Extract Text from PDF File Using Python

Extract Text from PDF File Using Python

How to Extract Text from a PDF File Using pypdf in Python

Install pypdf

Sample Code to Extract Text from PDF using pypdf

Code

Output

How to Extract Text from a PDF File Using PyMuPDF in Python

Install PyMuPDF

Sample Code to Extract Text from PDF using PyMuPDF

Code

Output

Conclusion

See Also

Convert PDF to Image in Python

How to Convert PDF to Image in Python: A Step-by-Step Guide

What You Need to Convert PDF to Image in Python

Installing the Required Libraries

Step-by-Step Guide on Converting PDF to Image in Python

Step 1: Import the Necessary Libraries

Step 2: Convert PDF to Images

Explanation of the Code:

Step 3: Optional – Convert to Other Image Formats

Complete Code

Download the Sample PDF and View Its Screenshot

Output Images Generated by the Code

Alternative Methods to Convert PDF to Image in Python

Common Errors and Troubleshooting

Conclusion

Share and Explore

See Also