pdfsummary
Python script that creates a text summary of a PDF file.
We were inspired by the overwhelming amount of assigned reading that students receive, and hoped to reduce the workload with this project. We learned about how to utilize different packages in pip, how to use GitHub, and how to interact with PDF files in Python. Our biggest problem was figuring out how to adjust the parameters of our program to better extract the text in the PDFs.
Table of Contents
Installation
To install pdfsummary from GitHub:
git clone https://github.com/archen2019/pdfsummary
pdfsummary also needs the dependencies listed in requirements.txt. To install, run the following:
pip install -r requirements.txt
tesseract is also a required dependency. To install tesseract, follow the instructions at https://github.com/tesseract-ocr/tesseract/wiki.
Usage
To run pdfsummary, simply enter the cloned folder and run the file run.py. Then, enter the file name of the PDF, the number of sentences in the summary, and the number of keywords.
$ cd pdfsummary
$ python run.py
File Name: [FILE-NAME]
Number of sentences in summary: [NUM-SENTENCES]
Number of key phrases: [NUM-PHRASES]
This will create 4 files in the directory of the original PDF:
keyphrases.txtA text file containing the key phrases.summary.txtA text file containing the summary.Summary.pdfA PDF file containing the key phrases and the summary.highlighted.pdfA PDF file containing the original pdf, with key phrases highlighted.
Methodology
- Use
pdf2imageto convert the PDF into PNG images. - Use
tesseractto extract text from images and create a text-searchable copy of the original PDF. - Process text to remove extra newlines and reconnect hyphenated words.
- Use
sumyto generate a summary of the processed text. - Use
pketo generate key phrases from the processed text. - Create pdf containing key phrases and summary.
- Highlight key phrases in text-searchable PDF.
Citations
Boudin, Florian. “Pke: An Open Source Python-Based Keyphrase Extraction Toolkit.” Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, The COLING 2016 Organizing Committee, 2016, pp. 69–73. ACLWeb, https://www.aclweb.org/anthology/C16-2015.
Log in or sign up for Devpost to join the conversation.