Skip to content

π Utilities to split PDF files into smaller files, generate thumbnails and extract textual content.

License

Notifications You must be signed in to change notification settings

pixelastic/pietro

Repository files navigation

π Pietro

Set of utilities to deal with PDF files: splitting into individual pages, extracting text and images.

This modules relies on CLI binaries (qpdf, imagemagick, poppler-utils, etc), so it provide good interoperability, it runs every command from inside a Docker container.

It will be slower than other implementation, but it is portable.

This package has been created out of the need to centralize common pattern of PDF handling I needed for various projects. The way it names files and folders makes sense to me, they might not suit you.

import { extractPages, extractImages, extractText } from 'pietro';


// Extract each page of the pdf into its own file
await extractPages('./bigFile.pdf', './pages');
// ./pages/
// ./pages/001.pdf
// ./pages/...
// ./pages/401.pdf

// Extract the text of a given pdf into a txt file
await extractText('./pages/001.pdf', './text/001.txt');
// ./text/
// ./text/001.txt

// Extract all images from a given pdf file.
// ./raw contains ALL images, including presentational ones and masks
// ./illustrations only contains "real" illustration images
await extractImages('./pages/001.pdf', ./images');
// ./images/
// ./images/001/
// ./images/001/raw/
// ./images/001/raw/001.png
// ./images/001/raw/...
// ./images/001/raw/227.png
// ./images/001/illustrations/
// ./images/001/illustrations/1.png
// ./images/001/illustrations/...
// ./images/001/illustrations/5.png

About

π Utilities to split PDF files into smaller files, generate thumbnails and extract textual content.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •