{"id":9242,"date":"2020-10-13T17:05:30","date_gmt":"2020-10-13T17:05:30","guid":{"rendered":"https:\/\/www.askpython.com\/?p=9242"},"modified":"2020-10-13T17:05:32","modified_gmt":"2020-10-13T17:05:32","slug":"process-text-from-pdf-files","status":"publish","type":"post","link":"https:\/\/www.askpython.com\/python\/examples\/process-text-from-pdf-files","title":{"rendered":"How to Process Text from PDF Files in Python?"},"content":{"rendered":"\n<p>PDFs are a common way to share text. <em>PDF<\/em> stands for <em><strong>Portable Document Format<\/strong><\/em> and uses the<strong> <em>.pdf<\/em> file extension<\/strong>. It was created in the early 1990s by Adobe Systems.<\/p>\n\n\n\n<p>Reading PDF documents using python can help you automate a wide variety of tasks. <\/p>\n\n\n\n<p>In this tutorial we will learn how to <strong>extract text from a PDF file in Python<\/strong>. <\/p>\n\n\n\n<p>Let&#8217;s get started. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Reading and Extracting Text from a PDF File in Python<\/h2>\n\n\n\n<p>For the purpose of this tutorial we are creating a sample PDF with 2 pages. You can do so using any Word processor like Microsoft Word or Google Docs and save the file as a PDF.<\/p>\n\n\n\n<p>Text on page 1:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nHello World. \nThis is a sample PDF with 2 pages. \nThis is the first page. \n<\/pre><\/div>\n\n\n<p>Text on page 2:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nThis is the text on Page 2. \n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\">Using PyPDF2 to Extract PDF Text<\/h2>\n\n\n\n<p>You can use <a class=\"rank-math-link\" href=\"https:\/\/pypi.org\/project\/PyPDF2\/\" target=\"_blank\" rel=\"noopener\">PyPDF2 <\/a>to extract text from a PDF. Let&#8217;s see how it works. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1. Install the package <\/h3>\n\n\n\n<p>To install PyPDF2 on your system enter the following command on your terminal. You can read more about the <a href=\"https:\/\/www.askpython.com\/python-modules\/python-pip\" class=\"rank-math-link\">pip package manager<\/a>.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\npip install pypdf2\n<\/pre><\/div>\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"670\" height=\"134\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/pypdf.png\" alt=\"Pypdf\" class=\"wp-image-9243\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/pypdf.png 670w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/pypdf-300x60.png 300w\" sizes=\"auto, (max-width: 670px) 100vw, 670px\" \/><figcaption>Pypdf<\/figcaption><\/figure><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">2. Import PyPDF2<\/h3>\n\n\n\n<p>Open a new python notebook and start with importing PyPDF2. <\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport PyPDF2\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\">3. Open the PDF in read-binary mode<\/h3>\n\n\n\n<p>Start with opening the PDF in <a href=\"https:\/\/www.askpython.com\/python\/built-in-methods\/open-files-in-python\" class=\"rank-math-link\">read binary mode<\/a> using the following line of code:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\npdf = open(&#039;sample_pdf.pdf&#039;, &#039;rb&#039;)\n<\/pre><\/div>\n\n\n<p>This will create a <strong>PdfFileReader object<\/strong> for our PDF and store it to the variable &#8216;<em>pdf&#8217;.<\/em><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4. Use PyPDF2.PdfFileReader() to read text<\/h3>\n\n\n\n<p>Now you can use the <strong>PdfFileReader<\/strong>() method from PyPDF2 to read the file. <\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\npdfReader = PyPDF2.PdfFileReader(pdf)\n<\/pre><\/div>\n\n\n<p>To get the text from the first page of the PDF, use the following lines of code:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\npage_one = pdfReader.getPage(0)\nprint(page_one.extractText())\n<\/pre><\/div>\n\n\n<p>We get the output as:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nHello World. \n!This is a sample PDF with 2 pages. !This is the first page. !\n\nProcess finished with exit code 0\n<\/pre><\/div>\n\n\n<p>Here we used the getPage method to store the page as an object. Then we used extractText() method to get text from the page object. <\/p>\n\n\n\n<p>The text we get is of type <strong>String. <\/strong><\/p>\n\n\n\n<p>Similarly to get the second page from the PDF use:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\npage_one = pdfReader.getPage(1)\nprint(page_one.extractText())\n<\/pre><\/div>\n\n\n<p>We get the output as :<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nThis is the text on Page 2. \n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\">Complete Code to Read PDF Text using PyPDF2<\/h3>\n\n\n\n<p>The complete code from this section is given below:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport PyPDF2\npdf = open(&#039;sample_pdf.pdf&#039;, &#039;rb&#039;)\npdfReader = PyPDF2.PdfFileReader(pdf)\npage_one = pdfReader.getPage(0)\nprint(page_one.extractText())\n<\/pre><\/div>\n\n\n<p>If you notice, the formatting of the first page is a little off in the output above. This is because PyPDF2 is not very efficient at reading PDFs. <\/p>\n\n\n\n<p>Luckily, Python has a better alternative to PyPDF2. We are going to look at that next. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Using PDFplumber to Extract Text<\/h2>\n\n\n\n<p><strong>PDFplumber<\/strong> is another tool that can extract text from a PDF. It is more powerful as compared to PyPDF2.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1. Install the package<\/h3>\n\n\n\n<p>Let&#8217;s get started with installing PDFplumber. <\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\npip install pdfplumber\n<\/pre><\/div>\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"604\" height=\"403\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/pdfplumber.png\" alt=\"Pdfplumber\" class=\"wp-image-9244\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/pdfplumber.png 604w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/pdfplumber-300x200.png 300w\" sizes=\"auto, (max-width: 604px) 100vw, 604px\" \/><figcaption>Pdfplumber<\/figcaption><\/figure><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">2. Import pdfplumber<\/h3>\n\n\n\n<p>Start with importing PDFplumber using the following line of code :<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport pdfplumber\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\">3. Using PDFplumber to read pdfs<\/h3>\n\n\n\n<p>You can start reading PDFs using PDFplumber with the following piece of code:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nwith pdfplumber.open(&quot;sample_pdf.pdf&quot;) as pdf:\n    first_page = pdf.pages&#x5B;0]\n    print(first_page.extract_text())\n<\/pre><\/div>\n\n\n<p>This will get the text from first page of our PDF. The output comes as:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nHello World. \n\nThis is a sample PDF with 2 pages. \n\nThis is the \ufb01rst page. \n\n\nProcess finished with exit code 0\n\n<\/pre><\/div>\n\n\n<p>You can compare this with the output of PyPDF2 and see how PDFplumber is better when it comes to formatting. <\/p>\n\n\n\n<p>PDFplumber also provides options to get other information from the PDF. <\/p>\n\n\n\n<p>For example, you can use<strong> .page_number<\/strong> to get the page number. <\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nprint(first_page.page_number)\n<\/pre><\/div>\n\n\n<p>Output :<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n1\n<\/pre><\/div>\n\n\n<p>To learn more about the methods under PDFPlumber refer to its official <a class=\"rank-math-link\" href=\"https:\/\/github.com\/jsvine\/pdfplumber\" target=\"_blank\" rel=\"noopener\">documentation.<\/a> <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion <\/h2>\n\n\n\n<p>This tutorial was about reading text from PDFs. We looked at two different tools and saw how one is better than the other. <\/p>\n\n\n\n<p>Now that you know how to read text from a PDF, you should read our tutorial on <a class=\"rank-math-link\" href=\"https:\/\/www.askpython.com\/python-modules\/tokenization-in-python-using-nltk\">tokenization<\/a> to get started with Natural Language Processing!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>PDFs are a common way to share text. PDF stands for Portable Document Format and uses the .pdf file extension. It was created in the early 1990s by Adobe Systems. Reading PDF documents using python can help you automate a wide variety of tasks. In this tutorial we will learn how to extract text from [&hellip;]<\/p>\n","protected":false},"author":14,"featured_media":9246,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[],"class_list":["post-9242","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-examples"],"blocksy_meta":[],"_links":{"self":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/9242","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/users\/14"}],"replies":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/comments?post=9242"}],"version-history":[{"count":0,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/9242\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media\/9246"}],"wp:attachment":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media?parent=9242"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/categories?post=9242"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/tags?post=9242"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}