Image

Imagesimbab wrote in Imagelinux

Extract text from any document

I am working on a final project for a database systems course. It is essentially a clone of Beagle, except it's oriented towards a collection of optical data discs or other removable media instead of a live index of the user's hard disk.

The project is essentially done, and right now I am ironing out the bugs and making it (more) presentable. The way it works now it has a modular "filter" system which looks for a filter module based on the file extension (infrastructure exists for this to be mime type as well, but it is not used). Right now, since supporting a multitude of file formats isn't really the point of this project, it just has two such filter modules: one for PDF and another for Word (OLE).

And basically, all these modules do is delegate to some command-line utility and then read it's standard output. In the case of PDF, that utility is pdftotext. In the Word doc case, it actually uses the beagle-doc-extractor tool from Beagle itself. Some massaging of the output is done before it's fed into the database.

But I was curious if there was some way to extract the text from "any" (for reasonable values of "any") document, similar to the way the GNU a2ps utility formats lots of different kinds of files into PostScript. If I can get a lot of text-like files in one fell swoop, that would be awesome. It's not a huge part of the project but it would be a nice plus.

Clues, anyone?