<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Python on File Format Blog</title>
    <link>https://blog.fileformat.com/categories/python/</link>
    <description>Recent content in Python on File Format Blog</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en</language>
    <lastBuildDate>Wed, 29 Jan 2025 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.fileformat.com/categories/python/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Working with PDF files in Python</title>
      <link>https://blog.fileformat.com/programming/working-with-pdf-files-in-python/</link>
      <pubDate>Wed, 29 Jan 2025 00:00:00 +0000</pubDate>
      
      <guid>https://blog.fileformat.com/programming/working-with-pdf-files-in-python/</guid>
      <description>Learn how to extract text from a PDF in Python, rotate PDF pages, merge multiple PDFs, split PDFs, and add watermarks to your PDFs using Python libraries and simple code examples.</description>
      <content:encoded><![CDATA[<p><strong>Last Updated</strong>: 29 Jan, 2025</p>
<figure class="align-center ">
    <img loading="lazy" src="images/working-with-pdf-files-in-python.png#center"
         alt="Title - Working with PDF files in Python"/> 
</figure>

<p>In this article, we will guide you on <strong>how to work with PDF files using Python</strong>. For this, we’ll utilize the <a href="https://pypi.org/project/pypdf/"><strong>pypdf</strong></a> library.</p>
<p>Using the <strong>pypdf</strong> library, we&rsquo;ll demonstrate how to perform the following operations in Python:</p>
<ul>
<li>Extracting text from PDFs</li>
<li>Rotating PDF pages</li>
<li>Merging multiple PDFs</li>
<li>Splitting PDFs into separate files</li>
<li>Adding watermarks to PDF pages</li>
</ul>
<p><em><strong>Note</strong>: This article covers a lot of valuable details, so feel free to skip to the sections that interest you the most! The content is organized for easy navigation, so you can quickly focus on what&rsquo;s most relevant to you.</em></p>
<figure class="align-center ">
    <img loading="lazy" src="images/pdf-manipulation-with-pypdf.webp#center"
         alt="Illustration - Working with PDF files in Python"/> 
</figure>

<h2 id="sample-codes">Sample Codes</h2>
<p>You can download all the sample code used in this article from the following link. It includes the code, input files, and output files.</p>
<ul>
<li><a href="https://github.com/fileformat-blog-gists/code/tree/main/working-with-pdf-files-in-python">Code Examples and Input Files for Working with PDF Files in Python</a></li>
</ul>
<h2 id="install-pypdf">Install pypdf</h2>
<p>To install pypdf, simply run the following command in your terminal or command prompt:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>pip install pypdf
</span></span></code></pre></div><p><strong>Note:</strong> The above command is case-sensitive.</p>
<h2 id="1-extracting-text-from-a-pdf-file-using-python">1. Extracting Text from a PDF File Using Python</h2>
<script type="application/javascript" src="https://gist.github.com/fileformat-blog-gists/e2b43a49dbad9e89745f8f9777817acb.js?file=extract-text-from-pdf-using-pypdf-in-python.py"></script>

<h3 id="code-explanation"><strong>Code Explanation</strong></h3>
<p><strong>1. Creating a PDF Reader Object</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>reader <span style="color:#f92672">=</span> PdfReader(pdf_file)
</span></span></code></pre></div><ul>
<li><code>PdfReader(pdf_file)</code> loads the PDF file into a <strong>reader object</strong>.</li>
<li>This object allows access to the pages and their content.</li>
</ul>
<p><strong>2. Looping Through Pages</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">for</span> page_number, page <span style="color:#f92672">in</span> enumerate(reader<span style="color:#f92672">.</span>pages, start<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>):
</span></span></code></pre></div><ul>
<li><code>reader.pages</code> returns a list of pages in the PDF.</li>
<li><code>enumerate(..., start=1)</code> assigns a <strong>page number starting from 1</strong>.</li>
</ul>
<p><strong>3. Printing Extracted Text</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>    print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Page </span><span style="color:#e6db74">{</span>page_number<span style="color:#e6db74">}</span><span style="color:#e6db74">:&#34;</span>)
</span></span><span style="display:flex;"><span>    print(page<span style="color:#f92672">.</span>extract_text())
</span></span><span style="display:flex;"><span>    print(<span style="color:#e6db74">&#34;-&#34;</span> <span style="color:#f92672">*</span> <span style="color:#ae81ff">50</span>)  <span style="color:#75715e"># Separator for readability</span>
</span></span></code></pre></div><ul>
<li><code>page.extract_text()</code> extracts text content from the current page.</li>
<li>The script prints the extracted text along with the <strong>page number</strong>.</li>
<li><code>&quot;-&quot; * 50</code> prints a separator line (<code>--------------------------------------------------</code>) for better readability.</li>
</ul>
<h3 id="input-pdf-file-used-in-the-code">Input PDF File Used in the Code</h3>
<ul>
<li><strong>Input File:</strong> <a href="https://github.com/fileformat-blog-gists/code/blob/main/working-with-pdf-files-in-python/pdf-to-extract-text/">Download Link</a></li>
</ul>
<h3 id="output-of-the-code">Output of the Code</h3>
<script type="application/javascript" src="https://gist.github.com/fileformat-blog-gists/ab6976aa3a0fc2999093f5f9320a9e20.js?file=Output%20-%20extract-text-from-pdf-using-pypdf-in-python.txt"></script>

<h2 id="2-rotating-pdf-pages-using-python">2. Rotating PDF Pages Using Python</h2>
<script type="application/javascript" src="https://gist.github.com/fileformat-blog-gists/760d480cfede4178296c353d60662e1a.js?file=rotate-pdf-page-using-pypdf-in-python.py"></script>

<h3 id="code-explanation-1">Code Explanation</h3>
<p>The code basically rotates the <strong>first page</strong> by <strong>90° clockwise</strong> and saves the modified PDF without affecting other pages.</p>
<p><strong>1. Import Required Classes</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> pypdf <span style="color:#f92672">import</span> PdfReader, PdfWriter
</span></span></code></pre></div><ul>
<li><code>PdfReader</code>: Reads the input PDF.</li>
<li><code>PdfWriter</code>: Creates a new PDF with modifications.</li>
</ul>
<p><strong>2. Define Input and Output File Paths</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>input_pdf <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;pdf-to-rotate/input.pdf&#34;</span>
</span></span><span style="display:flex;"><span>output_pdf <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;pdf-to-rotate/rotated_output.pdf&#34;</span>
</span></span></code></pre></div><ul>
<li>The script reads from <code>input.pdf</code> and saves the modified file as <code>rotated_output.pdf</code>.</li>
</ul>
<p><strong>3. Read the PDF and Create a Writer Object</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>reader <span style="color:#f92672">=</span> PdfReader(input_pdf)
</span></span><span style="display:flex;"><span>writer <span style="color:#f92672">=</span> PdfWriter()
</span></span></code></pre></div><ul>
<li><code>reader</code> loads the existing PDF.</li>
<li><code>writer</code> is used to store the modified pages.</li>
</ul>
<p><strong>4. Rotate the First Page by 90 Degrees</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>page <span style="color:#f92672">=</span> reader<span style="color:#f92672">.</span>pages[<span style="color:#ae81ff">0</span>]
</span></span><span style="display:flex;"><span>page<span style="color:#f92672">.</span>rotate(<span style="color:#ae81ff">90</span>)  <span style="color:#75715e"># Rotate 90 degrees clockwise</span>
</span></span><span style="display:flex;"><span>writer<span style="color:#f92672">.</span>add_page(page)
</span></span></code></pre></div><ul>
<li>Extracts <strong>page 1</strong>, rotates it <strong>90 degrees</strong>, and adds it to the new PDF.</li>
</ul>
<p><strong>5. Add Remaining Pages Without Changes</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">1</span>, len(reader<span style="color:#f92672">.</span>pages)):
</span></span><span style="display:flex;"><span>    writer<span style="color:#f92672">.</span>add_page(reader<span style="color:#f92672">.</span>pages[i])
</span></span></code></pre></div><ul>
<li>Loops through the remaining pages and adds them as they are.</li>
</ul>
<p><strong>6. Save the New PDF</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">with</span> open(output_pdf, <span style="color:#e6db74">&#34;wb&#34;</span>) <span style="color:#66d9ef">as</span> file:
</span></span><span style="display:flex;"><span>    writer<span style="color:#f92672">.</span>write(file)
</span></span></code></pre></div><ul>
<li>Opens <code>rotated_output.pdf</code> in write-binary mode and saves the new PDF.</li>
</ul>
<p><strong>7. Print Confirmation</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Rotated page saved to </span><span style="color:#e6db74">{</span>output_pdf<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span></code></pre></div><ul>
<li>Displays a success message.</li>
</ul>
<h3 id="input-pdf-used-in-the-code-and-its-rotated-output">Input PDF Used in the Code and Its Rotated Output</h3>
<ul>
<li><strong>Input PDF File:</strong> <a href="https://github.com/fileformat-blog-gists/code/tree/main/working-with-pdf-files-in-python/pdf-to-rotate/">Download Link</a></li>
<li><strong>Output Rotated PDF File:</strong> <a href="https://github.com/fileformat-blog-gists/code/tree/main/working-with-pdf-files-in-python/pdf-to-rotate/rotated_output.pdf">Download Link</a></li>
</ul>
<p><strong>Screenshot</strong>
<img loading="lazy" src="https://raw.githubusercontent.com/fileformat-blog-gists/content/main/working-with-pdf-files-in-python/rotated-pdf.png" alt="Screenshot of Rotated Page in PDF Using Python"  />
</p>
<h2 id="3-merging-pdf-files-using-python">3. Merging PDF Files Using Python</h2>
<p>This Python script demonstrates how to <strong>merge multiple PDF files</strong> from a directory into a single PDF using the <strong>PyPDF</strong> library.</p>
<script type="application/javascript" src="https://gist.github.com/fileformat-blog-gists/a1a571783e0f5e699678d1094bf1afa5.js?file=merge_pdf_files_using_pypdf_in_python.py"></script>

<h3 id="code-explanation-2">Code Explanation</h3>
<ul>
<li>This script automatically merges all PDF files found in the specified directory (<code>pdfs-to-merge</code>) into a single output file (<code>merged_output.pdf</code>).</li>
<li>It ensures the output directory exists and adds each PDF&rsquo;s pages in the order they are listed.</li>
<li>It outputs the final merged file in the <code>output-dir</code> subdirectory.</li>
</ul>
<p><strong>Code Breakdown</strong></p>
<p><strong>1. Import Libraries</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> os
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> pypdf <span style="color:#f92672">import</span> PdfReader, PdfWriter
</span></span></code></pre></div><ul>
<li><code>os</code>: Used to interact with the file system, such as reading directories and managing file paths.</li>
<li><code>PdfReader</code>: Reads the content of a PDF file.</li>
<li><code>PdfWriter</code>: Creates and writes a new PDF file.</li>
</ul>
<p><strong>2. Define Directory and Output File</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>directory <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;pdfs-to-merge&#34;</span>
</span></span><span style="display:flex;"><span>output_file <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;output-dir/merged_output.pdf&#34;</span>
</span></span></code></pre></div><ul>
<li><code>directory</code>: Specifies the folder where the PDF files are stored.</li>
<li><code>output_file</code>: Defines the output path and name of the merged PDF.</li>
</ul>
<p><strong>3. Create Output Directory if It Doesn&rsquo;t Exist</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>os<span style="color:#f92672">.</span>makedirs(os<span style="color:#f92672">.</span>path<span style="color:#f92672">.</span>join(directory, <span style="color:#e6db74">&#34;output-dir&#34;</span>), exist_ok<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span></code></pre></div><ul>
<li>This ensures the <strong>output directory</strong> exists, and if it doesn&rsquo;t, it creates it.</li>
</ul>
<p><strong>4. Create a PdfWriter Object</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>writer <span style="color:#f92672">=</span> PdfWriter()
</span></span></code></pre></div><ul>
<li><code>writer</code> is used to collect and combine all the pages from the PDFs.</li>
</ul>
<p><strong>5. Iterate Over All PDF Files in the Directory</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">for</span> file_name <span style="color:#f92672">in</span> sorted(os<span style="color:#f92672">.</span>listdir(directory)):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> file_name<span style="color:#f92672">.</span>endswith(<span style="color:#e6db74">&#34;.pdf&#34;</span>):
</span></span><span style="display:flex;"><span>        file_path <span style="color:#f92672">=</span> os<span style="color:#f92672">.</span>path<span style="color:#f92672">.</span>join(directory, file_name)
</span></span><span style="display:flex;"><span>        print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Adding: </span><span style="color:#e6db74">{</span>file_name<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span></code></pre></div><ul>
<li>This loop goes through all files in the specified directory, checking for files with the <code>.pdf</code> extension. It uses <code>sorted()</code> to process them in alphabetical order.</li>
</ul>
<p><strong>6. Read Each PDF and Append Pages to the Writer</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>reader <span style="color:#f92672">=</span> PdfReader(file_path)
</span></span><span style="display:flex;"><span>writer<span style="color:#f92672">.</span>append(reader)
</span></span></code></pre></div><ul>
<li>For each PDF, <code>PdfReader</code> reads the file, and then all pages from that PDF are appended to <code>writer</code>.</li>
</ul>
<p><strong>7. Write the Merged PDF to an Output File</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>output_path <span style="color:#f92672">=</span> os<span style="color:#f92672">.</span>path<span style="color:#f92672">.</span>join(directory, output_file)
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">with</span> open(output_path, <span style="color:#e6db74">&#34;wb&#34;</span>) <span style="color:#66d9ef">as</span> output_pdf:
</span></span><span style="display:flex;"><span>    writer<span style="color:#f92672">.</span>write(output_pdf)
</span></span></code></pre></div><ul>
<li>After collecting all pages, <code>writer.write()</code> writes the merged PDF to the specified output path.</li>
</ul>
<p><strong>8. Print Confirmation</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Merged PDF saved as: </span><span style="color:#e6db74">{</span>output_path<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span></code></pre></div><ul>
<li>Prints a success message confirming the location of the saved merged PDF.</li>
</ul>
<h3 id="input-pdf-files-used-in-the-code-and-the-merged-output-pdf">Input PDF Files Used in the Code and the Merged Output PDF</h3>
<ul>
<li><strong>Input PDF Files:</strong> <a href="https://github.com/fileformat-blog-gists/code/tree/main/working-with-pdf-files-in-python/pdfs-to-merge">Download Link</a></li>
<li><strong>Merged Output PDF:</strong> <a href="https://github.com/fileformat-blog-gists/code/tree/main/working-with-pdf-files-in-python/pdfs-to-merge/output-dir">Download Link</a></li>
</ul>
<h2 id="4-splitting-a-pdf-using-python">4. Splitting a PDF Using Python</h2>
<script type="application/javascript" src="https://gist.github.com/fileformat-blog-gists/0dee64422ac0dcf44cf027d90567bbf8.js?file=split-pdf-using-pypdf-in-python.py"></script>

<h3 id="code-explanation-3">Code Explanation</h3>
<p>The above Python script splits a PDF into separate pages using the <strong>PyPDF</strong> library. It first ensures that the output directory exists, then reads the input PDF file. The script loops through each page, creates a new <strong>PdfWriter</strong> object, and saves each page as an individual PDF file. The output files are named sequentially (e.g., <strong>page_1.pdf, page_2.pdf</strong>) and stored in the <strong><code>output-dir</code></strong> folder. Finally, it prints a confirmation message for each created file and notifies when the process is complete.</p>
<h3 id="input-pdf-and-split-output-files">Input PDF and Split Output Files</h3>
<ul>
<li><strong>Input PDF File:</strong> <a href="https://github.com/fileformat-blog-gists/code/tree/main/working-with-pdf-files-in-python/pdf-to-split">Download Link</a></li>
<li><strong>Split Output Files:</strong> <a href="https://github.com/fileformat-blog-gists/code/tree/main/working-with-pdf-files-in-python/pdf-to-split/output-dir">Download Link</a></li>
</ul>
<h2 id="5-adding-a-watermark-to-a-pdf-using-python">5. Adding a Watermark to a PDF Using Python</h2>
<p>You can add a watermark to a PDF using the PyPDF library by overlaying a watermark PDF onto an existing PDF. Make sure the watermark PDF has only one page so it applies correctly to each page of the main PDF.</p>
<script type="application/javascript" src="https://gist.github.com/fileformat-blog-gists/af057943580e2fcde6a635df34d7e39a.js?file=watermark-pdf-using-pypdf-in-python.py"></script>

<h3 id="code-explanation-4">Code Explanation</h3>
<p>The above Python script reads an input PDF, extracts a one-page watermark PDF, overlays the watermark on each page of the input PDF, and saves the final watermarked PDF.</p>
<p><strong>Code Breakdown</strong></p>
<p>Here&rsquo;s a brief explanation of each part</p>
<p><strong>1. Import Required Classes</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> pypdf <span style="color:#f92672">import</span> PdfReader, PdfWriter
</span></span></code></pre></div><ul>
<li><strong><code>PdfReader</code></strong> is used to read existing PDFs.</li>
<li><strong><code>PdfWriter</code></strong> is used to create and write a new PDF.</li>
</ul>
<p><strong>2. Define File Paths</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>input_pdf <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;pdf-to-watermark/input.pdf&#34;</span>
</span></span><span style="display:flex;"><span>watermark_pdf <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;pdf-to-watermark/watermark.pdf&#34;</span>
</span></span><span style="display:flex;"><span>output_pdf <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;pdf-to-watermark/output_with_watermark.pdf&#34;</span>
</span></span></code></pre></div><ul>
<li><code>input_pdf</code>: The original PDF to which the watermark will be added.</li>
<li><code>watermark_pdf</code>: A separate <strong>one-page</strong> PDF that serves as the watermark.</li>
<li><code>output_pdf</code>: The output file that will contain the watermarked pages.</li>
</ul>
<p><strong>3. Read PDFs</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>reader <span style="color:#f92672">=</span> PdfReader(input_pdf)
</span></span><span style="display:flex;"><span>watermark <span style="color:#f92672">=</span> PdfReader(watermark_pdf)
</span></span></code></pre></div><ul>
<li><code>reader</code>: Reads the input PDF.</li>
<li><code>watermark</code>: Reads the watermark PDF.</li>
</ul>
<p><strong>4. Create a Writer Object</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>writer <span style="color:#f92672">=</span> PdfWriter()
</span></span></code></pre></div><ul>
<li>This will be used to create the final watermarked PDF.</li>
</ul>
<p><strong>5. Extract Watermark Page</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>watermark_page <span style="color:#f92672">=</span> watermark<span style="color:#f92672">.</span>pages[<span style="color:#ae81ff">0</span>]
</span></span></code></pre></div><ul>
<li>Assumes that the watermark PDF has only <strong>one page</strong>, which is used to overlay on all pages.</li>
</ul>
<p><strong>6. Loop Through Input PDF Pages &amp; Merge Watermark</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">for</span> page <span style="color:#f92672">in</span> reader<span style="color:#f92672">.</span>pages:
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Merge the watermark with the current page</span>
</span></span><span style="display:flex;"><span>    page<span style="color:#f92672">.</span>merge_page(watermark_page)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Add the merged page to the writer</span>
</span></span><span style="display:flex;"><span>    writer<span style="color:#f92672">.</span>add_page(page)
</span></span></code></pre></div><ul>
<li>Iterates through each page of <code>input_pdf</code>.</li>
<li><strong><code>merge_page(watermark_page)</code></strong> overlays the watermark on top of the current page.</li>
<li>Adds the modified page to the <code>writer</code>.</li>
</ul>
<p><strong>7. Save the Watermarked PDF</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">with</span> open(output_pdf, <span style="color:#e6db74">&#34;wb&#34;</span>) <span style="color:#66d9ef">as</span> output_file:
</span></span><span style="display:flex;"><span>    writer<span style="color:#f92672">.</span>write(output_file)
</span></span></code></pre></div><ul>
<li>Writes the modified pages into a new PDF file.</li>
</ul>
<p><strong>8. Print Confirmation</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Watermarked PDF saved as: </span><span style="color:#e6db74">{</span>output_pdf<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span></code></pre></div><ul>
<li>Prints the output file path for confirmation.</li>
</ul>
<h3 id="input-pdf-watermark-pdf-and-output-watermarked-pdf">Input PDF, Watermark PDF, and Output Watermarked PDF</h3>
<ul>
<li><strong>Input PDF File:</strong> <a href="https://github.com/fileformat-blog-gists/code/tree/main/working-with-pdf-files-in-python/pdf-to-watermark">Download Link</a></li>
<li><strong>Watermark PDF File:</strong> <a href="https://github.com/fileformat-blog-gists/code/tree/main/working-with-pdf-files-in-python/pdf-to-watermark">Download Link</a></li>
<li><strong>Output Watermarked PDF File:</strong> <a href="https://github.com/fileformat-blog-gists/code/tree/main/working-with-pdf-files-in-python/pdf-to-watermark">Download Link</a></li>
</ul>
<p><strong>Screenshot</strong>
<img loading="lazy" src="https://raw.githubusercontent.com/fileformat-blog-gists/content/main/working-with-pdf-files-in-python/watermark-pdf.png" alt="Screenshot of Watermarked PDF Using Python"  />
</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this guide, we explored essential PDF operations in Python, including extracting text, rotating pages, merging, splitting, and adding watermarks. With these skills, you can now build your own PDF manager and automate various PDF tasks efficiently.</p>
]]></content:encoded>
    </item>
    
    <item>
      <title>Extract Text from PDF File Using Python</title>
      <link>https://blog.fileformat.com/en/programming/extract-text-from-pdf-file-using-python/</link>
      <pubDate>Wed, 15 Jan 2025 00:00:00 +0000</pubDate>
      
      <guid>https://blog.fileformat.com/en/programming/extract-text-from-pdf-file-using-python/</guid>
      <description>This article will show you how to extract text from a PDF in Python using popular libraries like PyPDF and PyMuPDF. It will also provide sample code, sample files, and the output.</description>
      <content:encoded><![CDATA[<p><strong>Last Updated</strong>: 15 Jan, 2025</p>
<figure class="align-center ">
    <img loading="lazy" src="images/extract-text-from-pdf-file-using-python.webp#center"
         alt="Title - Extract Text from PDF File Using Python"/> 
</figure>

<h2 id="extract-text-from-pdf-file-using-python">Extract Text from PDF File Using Python</h2>
<p>In this article, we will let you know <strong>how to extract text from PDF file using Python</strong>.</p>
<p>PDF stands for <strong>Portable Document Format</strong> is a popular digital document format. This format is designed to allow documents to be viewed or shared easily and reliably, regardless of software, hardware or operating system.  PDF files have the extension <strong>.pdf</strong>.</p>
<p>To extract text from a PDF file using Python, these libraries are commonly used. We will show you how to extract text from a PDF using both of them.</p>
<ol>
<li><a href="https://pypi.org/project/pypdf/"><strong>pypdf</strong></a></li>
<li><a href="https://pypi.org/project/PyMuPDF/"><strong>PyMuPDF</strong></a></li>
</ol>
<h2 id="how-to-extract-text-from-a-pdf-file-using-pypdf-in-python">How to Extract Text from a PDF File Using pypdf in Python</h2>
<p>Here are the steps.</p>
<ol>
<li>Install <strong>pypdf</strong></li>
<li>Run the code given in this article</li>
<li>See the output</li>
</ol>
<h3 id="install-pypdf">Install pypdf</h3>
<p>You can install <strong>pypdf</strong> using the following command</p>
<pre tabindex="0"><code>pip install pypdf
</code></pre><h3 id="sample-code-to-extract-text-from-pdf-using-pypdf">Sample Code to Extract Text from PDF using pypdf</h3>
<p><strong>sample.pdf</strong> - <a href="https://github.com/shakeel-faiz/InputOutputDocs/raw/master/python-convert-pdf-to-image/sample.pdf">Download Link</a> (This sample PDF will be used in the code, but you can certainly use your own PDF.)</p>
<p><strong>screenshot of sample.pdf</strong></p>
<p><img loading="lazy" src="https://raw.githubusercontent.com/shakeel-faiz/InputOutputDocs/master/python-convert-pdf-to-image/sample-input-pdf-screenshot.png" alt="Sample Input PDF Screenshot"  />
</p>
<h3 id="code">Code</h3>
<p>Here is a complete code example for <strong>extracting text from a PDF using pypdf</strong>.</p>
<script type="application/javascript" src="https://gist.github.com/fileformat-blog-gists/50b8279dca1fa397849031e8d370cd95.js?file=extract-text-from-pdf-using-pypdf.py"></script>

<h3 id="output">Output</h3>
<p>Here is the output of the sample code provided above.</p>
<script type="application/javascript" src="https://gist.github.com/fileformat-blog-gists/6870826ad3c40b67dfc3d4aef838328b.js?file=output-extract-text-from-pdf-using-pypdf"></script>

<h2 id="how-to-extract-text-from-a-pdf-file-using-pymupdf-in-python">How to Extract Text from a PDF File Using PyMuPDF in Python</h2>
<p>Here are the steps.</p>
<ol>
<li>Install <strong>PyMuPDF</strong></li>
<li>Run the code given in this article</li>
<li>See the output</li>
</ol>
<h3 id="install-pymupdf">Install PyMuPDF</h3>
<p>Install <strong>PyMuPDF</strong>, also known as <strong>fitz</strong>, using the following command.</p>
<pre tabindex="0"><code>pip install pymupdf
</code></pre><h3 id="sample-code-to-extract-text-from-pdf-using-pymupdf">Sample Code to Extract Text from PDF using PyMuPDF</h3>
<p>We used the same pdf as used before</p>
<p><strong>sample.pdf</strong> - <a href="https://github.com/shakeel-faiz/InputOutputDocs/raw/master/python-convert-pdf-to-image/sample.pdf">Download Link</a> (This sample PDF will be used in the code, but you can certainly use your own PDF.)</p>
<h3 id="code-1">Code</h3>
<p>Here is a complete code example for <strong>extracting text from a PDF using PyMuPDF</strong>.</p>
<script type="application/javascript" src="https://gist.github.com/fileformat-blog-gists/799f8ecafe4d64feb803548b0d1db36d.js?file=extract-text-from-pdf-using-pymupdf.py"></script>

<h3 id="output-1">Output</h3>
<p>Here is the output of the sample code provided above.</p>
<script type="application/javascript" src="https://gist.github.com/fileformat-blog-gists/cfda58da76b68dea4c5269b627901417.js?file=output-extract-text-from-pdf-using-pymupdf"></script>

<h2 id="conclusion">Conclusion</h2>
<p>In this article, we provide a sample Python code, a sample file, and their output to demonstrate how to extract text from a PDF using two libraries: PyPDF and PyMuPDF.</p>
<p>If you have any questions or encounter any issues while running the code, feel free to leave a comment in <a href="https://forum.fileformat.com/">our forums</a>!</p>
<h2 id="see-also">See Also</h2>
<ul>
<li><a href="https://blog.fileformat.com/programming/convert-pdf-to-image-in-python/">Python PDF to Image Conversion: Step-by-Step Guide</a></li>
<li><a href="https://blog.fileformat.com/programming/batch-change-file-encoding-to-utf8/">Batch change file encoding to UTF-8</a></li>
</ul>
]]></content:encoded>
    </item>
    
    <item>
      <title>Convert PDF to Image in Python</title>
      <link>https://blog.fileformat.com/programming/convert-pdf-to-image-in-python/</link>
      <pubDate>Sat, 04 Jan 2025 00:00:00 +0000</pubDate>
      
      <guid>https://blog.fileformat.com/programming/convert-pdf-to-image-in-python/</guid>
      <description>Learn how to convert a PDF file to image (JPEG, PNG) in Python with detailed examples. Step-by-step guide using popular libraries like pdf2image and PyMuPDF.</description>
      <content:encoded><![CDATA[<p><strong>Last Updated</strong>: 27 Jan, 2025</p>
<figure class="align-center ">
    <img loading="lazy" src="images/convert-pdf-to-image-in-python.webp#center"
         alt="Title - Python PDF to Image Conversion: Step-by-Step Guide"/> 
</figure>

<h2 id="how-to-convert-pdf-to-image-in-python-a-step-by-step-guide">How to Convert PDF to Image in Python: A Step-by-Step Guide</h2>
<p>Converting PDF files into image formats like <a href="https://docs.fileformat.com/image/jpeg/">JPEG</a> or <a href="https://docs.fileformat.com/image/png/">PNG</a> can be extremely useful, especially for scenarios where you need to extract images from a PDF, present a preview of the document, or work with visual data. <a href="https://www.python.org/">Python</a>, being a versatile programming language, offers multiple ways to perform this task efficiently.</p>
<p>In this guide, we&rsquo;ll walk you through a <strong>step-by-step process</strong> of converting a PDF to an image in Python. You’ll learn how to do this using popular Python libraries, examples of code, and helpful troubleshooting tips. We will also provide you complete code and its output images and sample PDF used inside it.</p>
<h2 id="what-you-need-to-convert-pdf-to-image-in-python">What You Need to Convert PDF to Image in Python</h2>
<p>Before we jump into the code, let&rsquo;s make sure you have the right tools to get started. For this task, you&rsquo;ll need to install the following Python libraries:</p>
<ol>
<li><a href="https://pillow.readthedocs.io/en/latest/handbook/tutorial.html"><strong>Pillow</strong></a>: A popular Python Imaging Library (PIL) that is often used for opening, manipulating, and saving image files.</li>
<li><a href="https://github.com/Belval/pdf2image"><strong>pdf2image</strong></a>: This library helps you convert PDF pages to images in Python. It uses <a href="https://poppler.freedesktop.org/"><strong>Poppler</strong></a> for rendering PDF pages into images.</li>
</ol>
<h3 id="installing-the-required-libraries">Installing the Required Libraries</h3>
<p>You can install these libraries using pip:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>pip install pillow pdf2image
</span></span></code></pre></div><p>If you don’t have <strong>Poppler</strong> installed on your system, you may need to install it separately. Check the installation guide for your platform <a href="https://github.com/Belval/pdf2image">here</a>.</p>
<h2 id="step-by-step-guide-on-converting-pdf-to-image-in-python">Step-by-Step Guide on Converting PDF to Image in Python</h2>
<h3 id="step-1-import-the-necessary-libraries">Step 1: Import the Necessary Libraries</h3>
<p>Start by importing the necessary Python libraries:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> pdf2image <span style="color:#f92672">import</span> convert_from_path
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> PIL <span style="color:#f92672">import</span> Image
</span></span></code></pre></div><h3 id="step-2-convert-pdf-to-images">Step 2: Convert PDF to Images</h3>
<p>With the libraries imported, you can now convert a PDF file to images. Here&rsquo;s how you do it:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Convert PDF to images</span>
</span></span><span style="display:flex;"><span>images <span style="color:#f92672">=</span> convert_from_path(<span style="color:#e6db74">&#39;yourfile.pdf&#39;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Save each page as an image</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> i, image <span style="color:#f92672">in</span> enumerate(images):
</span></span><span style="display:flex;"><span>    image<span style="color:#f92672">.</span>save(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;page_</span><span style="color:#e6db74">{</span>i<span style="color:#e6db74">}</span><span style="color:#e6db74">.jpg&#39;</span>, <span style="color:#e6db74">&#39;JPEG&#39;</span>)
</span></span></code></pre></div><h3 id="explanation-of-the-code">Explanation of the Code:</h3>
<ul>
<li>The <code>convert_from_path()</code> function converts the PDF file into a list of <strong>PIL image objects</strong>.</li>
<li>We then loop through the images and save each page of the PDF as a separate image (in this case, JPEG format).</li>
</ul>
<h3 id="step-3-optional--convert-to-other-image-formats"><strong>Step 3: Optional – Convert to Other Image Formats</strong></h3>
<p>You can easily convert the images to other formats, like PNG, by changing the format in the <code>image.save()</code> method:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>image<span style="color:#f92672">.</span>save(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;page_</span><span style="color:#e6db74">{</span>i<span style="color:#e6db74">}</span><span style="color:#e6db74">.png&#39;</span>, <span style="color:#e6db74">&#39;PNG&#39;</span>)
</span></span></code></pre></div><h3 id="complete-code">Complete Code</h3>
<p>Here is the complete code. Simply copy it, save it with any name and the <code>.py</code> extension, and then execute it. For example, you can name it <code>convert_pdf_to_images.py</code>.</p>
<p>Before executing, just update the <code>pdf_path</code> variable to point to the path of your input PDF file.</p>
<script type="application/javascript" src="https://gist.github.com/fileformat-blog-gists/6e26bc3d0c73587f6be860e20a5d6881.js?file=convert-pdf-to-image-in-python.py"></script>

<h3 id="download-the-sample-pdf-and-view-its-screenshot">Download the Sample PDF and View Its Screenshot</h3>
<p>You can use any PDF, but for the sake of running and testing this code, we used this specific PDF.</p>
<ul>
<li><a href="https://github.com/fileformat-blog-gists/content/raw/main/convert-pdf-to-image-in-python/sample.pdf">Download Sample PDF</a></li>
</ul>
<p><img loading="lazy" src="https://raw.githubusercontent.com/fileformat-blog-gists/content/main/convert-pdf-to-image-in-python/sample-input-pdf-screenshot.png" alt="Sample Input PDF Screenshot"  />
</p>
<h3 id="output-images-generated-by-the-code">Output Images Generated by the Code</h3>
<ul>
<li>page_1.jpg</li>
<li>page_2.jpg</li>
<li>page_3.jpg</li>
</ul>
<p><img loading="lazy" src="https://raw.githubusercontent.com/fileformat-blog-gists/content/main/convert-pdf-to-image-in-python/output-images/page_1.jpg" alt="page_1.jpg"  />

<img loading="lazy" src="https://raw.githubusercontent.com/fileformat-blog-gists/content/main/convert-pdf-to-image-in-python/output-images/page_2.jpg" alt="page_2.jpg"  />

<img loading="lazy" src="https://raw.githubusercontent.com/fileformat-blog-gists/content/main/convert-pdf-to-image-in-python/output-images/page_3.jpg" alt="page_3.jpg"  />
</p>
<h2 id="alternative-methods-to-convert-pdf-to-image-in-python">Alternative Methods to Convert PDF to Image in Python</h2>
<p>While <strong>pdf2image</strong> and <strong>Poppler</strong> are widely used, there are other methods to convert PDF to image without needing <strong>Poppler</strong>. For example:</p>
<ol>
<li><strong>Using PyMuPDF (fitz)</strong>: This library also allows you to extract images from PDFs and manipulate them.</li>
</ol>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>pip install pymupdf
</span></span></code></pre></div><p>Example code:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> fitz  <span style="color:#75715e"># PyMuPDF</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Open the PDF file</span>
</span></span><span style="display:flex;"><span>doc <span style="color:#f92672">=</span> fitz<span style="color:#f92672">.</span>open(<span style="color:#e6db74">&#34;yourfile.pdf&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Loop through each page and convert to image</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> page_num <span style="color:#f92672">in</span> range(len(doc)):
</span></span><span style="display:flex;"><span>    page <span style="color:#f92672">=</span> doc<span style="color:#f92672">.</span>load_page(page_num)
</span></span><span style="display:flex;"><span>    pix <span style="color:#f92672">=</span> page<span style="color:#f92672">.</span>get_pixmap()
</span></span><span style="display:flex;"><span>    pix<span style="color:#f92672">.</span>save(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;page_</span><span style="color:#e6db74">{</span>page_num<span style="color:#e6db74">}</span><span style="color:#e6db74">.png&#34;</span>)
</span></span></code></pre></div><p>This method works without requiring <strong>Poppler</strong> and can be an alternative if you&rsquo;re facing installation issues.</p>
<h2 id="common-errors-and-troubleshooting"><strong>Common Errors and Troubleshooting</strong></h2>
<p>While converting PDFs to images in Python is generally straightforward, you might encounter some issues. Here are a few common errors and their solutions:</p>
<ol>
<li>
<p><strong>Error: <code>OSError: cannot identify image file</code></strong></p>
<ul>
<li>This typically happens if the <strong>PDF is not properly rendered</strong>. Ensure <strong>Poppler</strong> is installed correctly and is accessible from your Python environment.</li>
</ul>
</li>
<li>
<p><strong>Error: <code>RuntimeError: cannot open image file</code></strong></p>
<ul>
<li>This error can occur if you&rsquo;re trying to open an image format that is unsupported. Double-check the format you&rsquo;re saving the image in (JPEG, PNG, etc.) and ensure that <strong>Pillow</strong> supports it.</li>
</ul>
</li>
</ol>
<h2 id="conclusion">Conclusion</h2>
<p>Converting PDF documents to images in Python is easy with the help of libraries like <strong>pdf2image</strong> and <strong>Pillow</strong>. Whether you&rsquo;re looking to extract images from a PDF or simply want to display each page as a picture, this guide has shown you how to do it step by step.</p>
<p>Remember, depending on your project needs, you can also explore other Python libraries like <strong>PyMuPDF</strong> to achieve similar results.</p>
<p>If you have any questions or run into any issues while implementing this solution, feel free to leave a comment in <a href="https://forum.fileformat.com/">our forums</a>!</p>
<h2 id="share-and-explore">Share and Explore</h2>
<p>If this guide helped you, don&rsquo;t forget to share it with others, and explore our other helpful guides for more coding tips and tricks!</p>
<h2 id="see-also">See Also</h2>
<ul>
<li><a href="https://blog.fileformat.com/programming/batch-change-file-encoding-to-utf8/">Batch change file encoding to UTF-8</a></li>
</ul>
]]></content:encoded>
    </item>
    
  </channel>
</rss>
