Image

Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to the Power Users community on Codidact!

Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.

Search tool for PDF content with verbatim text including special characters

+3
−0

I'm looking for a free (and if possible opensource) tool to search through the content of PDFs.

Requirements:

  • search full text of all PDFs in one folder (the PDFs are "plain text", not scans)
  • show search result in the context of the lines around it
  • allow special characters in search like \begin{frame}<1-> or \defbeamertemplate* without having to escape them. I'm looking for exact matches and don't need fuzzy search etc.
  • works on macOS15

So far I've tried

DocFetcher

✅ search full text of all PDFs in one folder. Search index needs to be manually updated

✅ show search result in the context of the lines around it. Shows the full context including line breaks.

❌ allow special characters in search like \begin{frame}<1-> or \defbeamertemplate* without having to escape them

  • searching for \begin{frame}<1-> will cause an error, searching for "\begin{frame}<1->" will find false results like begin{frame} $1
  • searching for \defbeamertemplate* will give false results like \defbeamertemplate{block}

✅ works on macOS15

Recoll

✅ search full text of all PDFs in one folder. Update of search index can be automated, e.g. with a cron job

✅ show search result in the context of the lines around it. Shown unformatted context, line breaks are missing

❌ allow special characters in search like \begin{frame}<1-> or \defbeamertemplate* without having to escape them

  • searching for \begin{frame}<1-> will find false results like \begin{frame} 1
  • searching for \defbeamertemplate* will give false results like \defbeamertemplate{block}

✅ works on macOS15

rga

✅ search full text of all PDFs in one folder.

❌ show search result in the context of the lines around it. Only shows one line

❌ allow special characters in search like \begin{frame}<1-> or \defbeamertemplate* without having to escape them

  • searching for rga \begin{frame}<1-> /path/to/my/folder does not give any matches, don't know how I would need to escape this...
  • searching for rga defbeamertemplate\* /path/to/my/folder will give false results like \defbeamertemplate{block}

✅ works on macOS15

History

1 comment thread

Requirements (4 comments)

2 answers

+2
−0

The big sticking point (you obviously know that, since you kept running into the same issue of needing to escape certain characters) is that too many of these systems want to have their own regular expression (or similar) engine. Poking around for something where that could be turned off, I found the uninspiringly but usefully named pdfgrep.

At first, it doesn't look like it'll work, but digging through the full man page, we find three or four options relevant to the question.

   -F, --fixed-strings
       Interpret PATTERN as a list of fixed strings separated by newlines, any of
       which is to be matched.

   -A NUM, --after-context=NUM
       Print NUM lines of context after matching lines. Contiguous groups of matches
       are separated by a line containing --. With -o, this option has no effect.

   -B NUM, --before-context=NUM
       Print NUM lines of context before matching lines. Contiguous groups of
       matches are separated by a line containing --. With -o, this option has no
       effect.

   -C NUM, --context=NUM
       Print NUM lines of context before and after matching lines. Contiguous groups
       of matches are separated by a line containing --. With -o, this option has no
       effect.

Then, at least in the BASH shell, single-quote the search string, and that should do it.

pdfgrep --fixed-strings --context=2 '\begin{frame}<1->' target-file.pdf

Plus or minus the actual desired context. Well, except that you'll still need to escape any apostrophes in the search-string, sadly, since that would break the quoting. But at least that's more predictable and has the recognizable failure mode of not running the command when you hit Enter or Return.

However, I see one bigger caveat in playing around: The target-string needs to appear on a single line in the document. I happen to have a PDF file handy that I generated from Markdown files, and it breaks words across lines when necessary; pdfgrep can't find those split words, because they technically sit in separate boxes, the way that most PDF output routines work.

History

1 comment thread

Thanks a lot for your answer! I very rarely have quotes in the search string and they are usually on ... (1 comment)
+2
−0

@JohnC's great answer made me realise that rga can actually use the same options (it passes them on to ripgrep):

rga --fixed-strings --context=2 '\begin{frame}<' /path/to/my/folder

which makes it

  • show as many additional lines as I would like
  • uses a fixed search expression and thus eliminating almost all problems with special characters (see @JohnC's answer for the exceptions)
History

0 comment threads

Sign up to answer this question »