pdfgrep Command in Linux
Last Updated :
21 Nov, 2022
Grep is a powerful tool to search for a pattern or regular expression in a text file but it cannot do the search in pdf files and that’s where pdfgrep comes into the picture. It’s a simple command used to search PDF files for a regular expression. In this article, we will discuss about the pdfgrep command and its usage.
Syntax:
Usage: pdfgrep [OPTION]... PATTERN FILE...
Installation of pdfgrep command
pdfgrep is not pre-installed like grep but it can be downloaded from the repositories in most of the Linux distributions.
1. For Ubuntu/Debian:
sudo apt-get install pdfgrep
2. For CentOS/Fedora:
sudo yum install pdfgrep
Working with pdfgrep:
pdfgrep command is compatible with GNU grep with some PDF-specific options. If you are familiar with grep, then most of the option looks familiar.
1. Basic Search
Let’s do a basic search for a string “General Linux” in a pdf file,
Example:
pdfgrep “General Linux” intro-linux.pdf
Output:
2. Print filename
Use –with-filename or -H option to display pdf file name along with the output when there is one file to search.
Example:
pdfgrep -H dns intro-linux.pdf
Output:
The command prints the filename by default when there is more than one file to search (implies -H).
3. Case-insensitive search:
Use –ignore-case or -i to do case insensitive search. Let’s search for the word dns.
Example:
pdfgrep -i dns intro-linux.pdf
Output:
The above output shows the matches for both dns and DNS.
4. Get the match count
Use –count or -c to see the count for the matches.
Example:
pdfgrep -ic dns intro-linux.pdf
Output:
Thus ignoring the case, dns was mentioned 28 times.
5. Show the page number
Use –page-number or -n to show the page number. This option would prefix each match with the page number where the pattern got matched.
Example:
pdfgrep -in dns intro-linux.pdf
Output:
6. Show match-count per page
Use –page-count or -p option to print the number of matches per page. This option implies page number (-n).
Example:
pdfgrep -ip dns intro-linux.pdf
Output:
The above output represents ‘page number: match count’. On page number 53, dns is present once. But the same is repeated 5 times on page number 169.
7. Stop match count
Use –max-count or -m option to stop reading the file when the number of pages crossed. This option can be used when the user doesn’t want to read the file after crossing the NUM matches.
Example:
pdfgrep -inm 10 dns intro-linux.pdf
Output:
The output shows only 10 matches for dns pattern and stopped reading the file further.
8. Context control
The following options can be used when the user wants to know what lines are present before, after, and around the match.
8.1 Context after the match
Use –after-context or -A option to print NUM lines of context after the match.
Example:
pdfgrep -A 2 dns intro-linux.pdf
Output:
Here we can see 2 lines are printed after the match and the contiguous group of matches is separated by –.
8.2 Context before the match
Use –before-context or -B to print NUM lines of context before the match.
Example:
pdfgrep -B 2 dns intro-linux.pdf
Output:
8.3 Context around match
Use –context or -C to print NUM lines of context before and after the match.
Example:
pdfgrep -C 2 dns intro-linux.pdf
Output:
9. Caching
PDF file consists of images along with the text. When the file is large it would take some time to skip the media and do the search which can be frustrating when we do frequent grep. There is an option –cache which would cache the rendered text and make the search time quicker. It would be helpful especially when the file is large.
Example:
time pdfgrep --cache iq dns intro-linux.pdf
Output:
Here the pattern dns is searched twice with and without enabling cache, where the command that includes –cache got completed faster than the other commands that didn’t include it. -q option is used to suppress the output for easy understanding.
10. Password protected file
Using –password option, pdfgrep tool can also be used to do a grep in password protected file.
Usage:
pdfgrep --password [password] [pattern] [pdf_file]
Example:
pdfgrep --password "ndey" dns intro-linux-protected.pdf
Table of Difference between grep and pdfgrep
Grep |
pdfgrep |
It works only on plain text files. |
It works only on pdf files. |
It is a default package. |
It is not a default package but can be downloaded from the repository. |
It operates on lines. |
It operates on pages. |
-n option is to show the line number |
-n option is to show the page number. |
Share your thoughts in the comments
Please Login to comment...