Picture me, if you would. Sitting at my desk, looking at my screen, puzzled. I
have several directories of hardware manuals, and I'm looking for something.
Where is it? I have no idea. I just want to know where the documentation for
one thing lives, in a sea of poorly named
Let's learn about
So if you wanted to grep PDFs, you would probably do something sensible like
walk every file, convert them to text (i.e.
pdftotext), and then grep that.
Which is pretty much what these tools do.
So I mentioned two tools,
ripgrep-all. Which of these should
you use? First let's check out
If you don't use
ripgrep then you're missing out, it's the fastest grepper in
the west and it's written in Rust, which is always a plus. Check out
rga takes that speedy grepping and adapts it to also work on
a great variety of file types.
rga is essentially a preprocessor for
ripgrep. You can now grep for subtitles in video files,
.docx files, crazy stuff like SQLite database entries,
and all of these things combined inside of various archive formats. If you're
absolutely insane, you can also get it to read characters from images, and
convert PDF -> PNG -> text for any file that's being tricky.
rga is all well and good, but if you have a huge amount of PDF files, you
don't want to have to convert them to text every single time? Well,
actually caches everything by default, so you should have a massive speedup the
second time around. Here's a before and after:
rga keyword 4404.57s user 179.05s system 614% cpu 12:25.80 total
rga keyword 1361.33s user 61.12s system 511% cpu 4:37.85 total
That's slightly misleading though, because
rga walks a bunch of archives in my
path. We can make it only care about PDFs for this comparison with
pdftotext) and get the following:
rga --rga-adapters=poppler keyword # no cache
2166.33s user 66.96s system 741% cpu 5:01.06 total
rga --rga-adapters=poppler keyword # cache
6.32s user 4.44s system 570% cpu 1.884 total
Wow. That's pretty fast. If you wanted to use
rga for all your grepping
needs but don't want it to go into archive files, you could alias it to
rga --rga-adapters=-zip,decompress. See the documentation for more details.
pdfgrep, a tool specifically designed to grep through
PDFs. Very similarly, it calls
pdftotext and can cache results, but unlike
rga it doesn't do this by default, you have to pass it
The key difference is not only the different grepping engine (which I can't
imagine is faster than
ripgrep), but also that
rga compresses its cache using
it use much less space.
pdfgrep's cache just uses plaintext.
Let's see how
pdfgrep does with and without caching.
pdfgrep -ri keyword^C # I got bored of waiting
pdfgrep only uses one thread. There's a suggestion to use
in the docs, so we can try that:
find . -name "*.pdf" -print0 | parallel -q0 pdfgrep -H -i keyword
4496.81s user 110.18s system 739% cpu 10:23.22 total
And again after building the cache:
67.34s user 35.45s system 669% cpu 15.354 total
OK, it's clear
pdfgrep is no competition. On top of that, since
doesn't compress its cache, it takes up more space:
du -h -d0 ~/.cache/pdfgrep: 1.2G
du -h -d0 ~/.cache/rga: 177M
The simplest option here is the best. Install
ripgrep-all, alias it to
if you want. Use it everywhere, for everything. It's fast, it automatically
caches, and it's very configurable; you can tune its cache compression level and its
archive recursion depth.