Perl | Searching in a File using regex
Prerequisite: Perl | Regular Expressions
Regular Expression (Regex or Regexp or RE) in Perl is a special text string for describing a search pattern within a given text. Regex in Perl is linked to host language and are not the same as in PHP, Python, etc. Sometimes these are termed as “Perl 5 Compatible Regular Expressions”. To use the Regex, Binding operators like
=~ (Regex Operator) and
!~ (Negated Regex Operator) are used.
These Binding regex operators are used to match a string from a regular expression. The left-hand side of the statement will contain a string which will be matched with the right-hand side which will contain the specified pattern. Negated regex operator checks if the string is not equal to the regular expression specified on the right-hand side.
Regex operators help in searching for a specific word or a group of words in a file. This can be done in multiple ways as per the user’s requirement. Searching in Perl follows the standard format of first opening the file in the read mode and further reading the file line by line and then look for the required string or group of strings in each line. When the required match is found, then the statement following the search expression will determine what is the next step to do with the matched string, it can be either added to any other file specified by the user or simply printed on the console.
Within the regular expression created to match the required string with the file, there can be multiple ways to search for the required string:
This is the basic pattern of writing a regular expression which looks for the required string within the specified file. Following is the syntax of such a Regular Expression:
$String =~ /the/
This expression will search for the lines in the file which contain a word with letters ‘the‘ in it and store that word in the variable
$String. Further, this variable’s value can be copied to a file or simply printed on the console.
As it can be seen that the above search also results in the selection of words which have ‘the’ as a part of it. To avoid such words the regular expression can be changed in the following manner:
$String =~ / the /
By providing spaces before and after the required word to be searched, the searched word is isolated from both the ends and no such word that contains it as a part of it is returned in the searching process. This will solve the problem of searching extra words which are not required. But, this will result in excluding the words that contain comma or full stop immediately after the requested search word.
To avoid such situation, there are other ways as well which help in limiting the search to a specific word, one of such ways is using the word boundary.
Using Word Boundary in Regex Search:
As seen in the above Example, regular search results in returning either the extra words which contain the searched word as a part of it or excluding some of the words if searched with spaces before and after the required word. To avoid such a situation, word boundary is used which is denoted by ‘
$String =~ /\bthe\b/;
This will limit the words which contain the requested word to be searched as a part of it and will not exclude the words that end with a comma or full stop.
As it can be seen in the above given example, the word which is ending with full stop is included in the search but the words which contain the searched words as a part are excluded. Hence, word boundary can help overcome the problem created in the Regular Search method.
What if there is a case in which there is a need to find words that either start or end or both with specific characters? Then that can’t be done with the use of Regular Search or the word boundary. For cases like these, Perl allows the use of WildCards in the Regular Expression.
Use of Wild Cards in Regular Expression:
Perl allows to search for a specific set of words or the words that follow a specific pattern in the given file with the use of Wild cards in Regular Expression. Wild cards are ‘dots’ placed within the regex along with the required word to be searched. These wildcards allow the regex to search for all the related words that follow the given pattern and will display the same. Wild cards help in reducing the number of iterations involved in searching for various different words which have a pattern of letters in common.
$String =~ /t..s/;
Above pattern will search for all the words which start with t, end with s, and have two letters/characters between them.
Above code contains all the words as specified in the given pattern.
In this method of printing the searched words, the whole line that contains that word gets printed which makes it difficult to find out exactly what word is searched by the user. To avoid this confusion, we can only print the searched words and not the whole sentence. This is done by grouping the searched pattern with the use of parentheses. To print this grouping of words,
$number variables are used.
$number variables are the matches from the last successful match of the capture groups that are formed in the regular expression. e.g. if there are multiple groupings in the regular expression then
$1 will print the words that match the first grouping, similarly,
$2 will match the second grouping and so on.
Given below is the above program transformed using the $number variables to show only the searched words and not the whole sentence: