ruscur's blog

Picture me, if you would. Sitting at my desk, looking at my screen, puzzled. I have several directories of hardware manuals, and I'm looking for something. Where is it? I have no idea. I just want to know where the documentation for one thing lives, in a sea of poorly named .pdf files. How do I solve this predicament?

Let's learn about pdfgrep and ripgrep-all, together.

So if you wanted to grep PDFs, you would probably do something sensible like walk every file, convert them to text (i.e. pdftotext), and then grep that. Which is pretty much what these tools do.

So I mentioned two tools, pdfgrep and ripgrep-all. Which of these should you use? First let's check out ripgrep-all.

If you don't use ripgrep then you're missing out, it's the fastest grepper in the west and it's written in Rust, which is always a plus. Check out ripgrep here.

ripgrep-all or rga takes that speedy grepping and adapts it to also work on a great variety of file types. rga is essentially a preprocessor for ripgrep. You can now grep for subtitles in video files, text in .pdf, .odt, .docx files, crazy stuff like SQLite database entries, and all of these things combined inside of various archive formats. If you're absolutely insane, you can also get it to read characters from images, and convert PDF -> PNG -> text for any file that's being tricky.

Get ripgrep-all here.

rga is all well and good, but if you have a huge amount of PDF files, you don't want to have to convert them to text every single time? Well, rga actually caches everything by default, so you should have a massive speedup the second time around. Here's a before and after:

rga keyword 4404.57s user 179.05s system 614% cpu 12:25.80 total
rga keyword 1361.33s user 61.12s system 511% cpu 4:37.85 total

That's slightly misleading though, because rga walks a bunch of archives in my path. We can make it only care about PDFs for this comparison with --rga-adapters=poppler (poppler provides pdftotext) and get the following:

rga --rga-adapters=poppler keyword # no cache
2166.33s user 66.96s system 741% cpu 5:01.06 total
rga --rga-adapters=poppler keyword # cache
6.32s user 4.44s system 570% cpu 1.884 total

Wow. That's pretty fast. If you wanted to use rga for all your grepping needs but don't want it to go into archive files, you could alias it to rga --rga-adapters=-zip,decompress. See the documentation for more details.

Alternatively, there's pdfgrep, a tool specifically designed to grep through PDFs. Very similarly, it calls pdftotext and can cache results, but unlike rga it doesn't do this by default, you have to pass it --cache.

The key difference is not only the different grepping engine (which I can't imagine is faster than ripgrep), but also that rga compresses its cache using LMDB, making it use much less space. pdfgrep's cache just uses plaintext.

Let's see how pdfgrep does with and without caching.

pdfgrep -ri keyword^C # I got bored of waiting

Yeah...pdfgrep only uses one thread. There's a suggestion to use parallel in the docs, so we can try that:

find . -name "*.pdf" -print0 | parallel -q0 pdfgrep -H -i keyword
4496.81s user 110.18s system 739% cpu 10:23.22 total

And again after building the cache:

67.34s user 35.45s system 669% cpu 15.354 total

OK, it's clear pdfgrep is no competition. On top of that, since pdfgrep doesn't compress its cache, it takes up more space:

du -h -d0 ~/.cache/pdfgrep: 1.2G
du -h -d0 ~/.cache/rga: 177M

The simplest option here is the best. Install ripgrep-all, alias it to grep if you want. Use it everywhere, for everything. It's fast, it automatically caches, and it's very configurable; you can tune its cache compression level and its archive recursion depth.

Grepping PDFs like a pro