Grepping PDFs like a pro
Picture me, if you would. Sitting at my desk, looking at my screen, puzzled. I
have several directories of hardware manuals, and I'm looking for something.
Where is it? I have no idea. I just want to know where the documentation for
one thing lives, in a sea of poorly named .pdf
files. How do I solve this predicament?
Let's learn about pdfgrep
and ripgrep-all
, together.
So if you wanted to grep PDFs, you would probably do something sensible like
walk every file, convert them to text (i.e. pdftotext
), and then grep that.
Which is pretty much what these tools do.
So I mentioned two tools, pdfgrep
and ripgrep-all
. Which of these should
you use? First let's check out ripgrep-all
.
If you don't use ripgrep
then you're missing out, it's the fastest grepper in
the west and it's written in Rust, which is always a plus. Check out ripgrep
here.
ripgrep-all
or rga
takes that speedy grepping and adapts it to also work on
a great variety of file types. rga
is essentially a preprocessor for
ripgrep
. You can now grep for subtitles in video files,
text in .pdf
, .odt
, .docx
files, crazy stuff like SQLite database entries,
and all of these things combined inside of various archive formats. If you're
absolutely insane, you can also get it to read characters from images, and
convert PDF -> PNG -> text for any file that's being tricky.
Get ripgrep-all
here.
rga
is all well and good, but if you have a huge amount of PDF files, you
don't want to have to convert them to text every single time? Well, rga
actually caches everything by default, so you should have a massive speedup the
second time around. Here's a before and after:
rga keyword 4404.57s user 179.05s system 614% cpu 12:25.80 total
rga keyword 1361.33s user 61.12s system 511% cpu 4:37.85 total
That's slightly misleading though, because rga
walks a bunch of archives in my
path. We can make it only care about PDFs for this comparison with
--rga-adapters=poppler
(poppler
provides pdftotext
) and get the following:
rga --rga-adapters=poppler keyword # no cache
2166.33s user 66.96s system 741% cpu 5:01.06 total
rga --rga-adapters=poppler keyword # cache
6.32s user 4.44s system 570% cpu 1.884 total
Wow. That's pretty fast. If you wanted to use rga
for all your grepping
needs but don't want it to go into archive files, you could alias it to rga --rga-adapters=-zip,decompress
. See the documentation for more details.
Alternatively, there's pdfgrep
, a tool specifically designed to grep through
PDFs. Very similarly, it calls pdftotext
and can cache results, but unlike
rga
it doesn't do this by default, you have to pass it --cache
.
The key difference is not only the different grepping engine (which I can't
imagine is faster than ripgrep
), but also that rga
compresses its cache using
LMDB
, making
it use much less space. pdfgrep
's cache just uses plaintext.
Let's see how pdfgrep
does with and without caching.
pdfgrep -ri keyword^C # I got bored of waiting
Yeah...pdfgrep
only uses one thread. There's a suggestion to use parallel
in the docs, so we can try that:
find . -name "*.pdf" -print0 | parallel -q0 pdfgrep -H -i keyword
4496.81s user 110.18s system 739% cpu 10:23.22 total
And again after building the cache:
67.34s user 35.45s system 669% cpu 15.354 total
OK, it's clear pdfgrep
is no competition. On top of that, since pdfgrep
doesn't compress its cache, it takes up more space:
du -h -d0 ~/.cache/pdfgrep: 1.2G
du -h -d0 ~/.cache/rga: 177M
The simplest option here is the best. Install ripgrep-all
, alias it to grep
if you want. Use it everywhere, for everything. It's fast, it automatically
caches, and it's very configurable; you can tune its cache compression level and its
archive recursion depth.