Perl | Special Character Classes in Regular Expressions

There are many different character classes implemented in Perl and some of them used so frequently that a special sequence is created for them. The aim of creating a special sequence is to make the code more readable and shorter. The Special Character Classes in Perl are as follows:

  1. Digit \d[0-9]: The \d is used to match any digit character and its equivalent to [0-9]. In the regex /\d/ will match a single digit. The \d is standardized to “digit”. The main advantage is that the user can easily write in shorter form and can easily read it. There are two ways to use this special character class. Let’s take an example for better understanding to know how to match the character string.

    Example:

    /#[MNOPQ]-\d\d\d/
    

    The above given character string will be match as below.

    #M-12345
    #N-66666
    

    Here, we can also make the use of quantifiers by putting that on the character class.

    Example:

    /#[MNOPQ]-\d{5}/
    

    The above-given example is same as the previous regex and it allows any number of digits after the dash and it can be written as /#[MNOPQ]-\d+/.

    The second method is used in the larger character classes. The \d is put in square bracket and match single character digit.

    Example:

    [\dABCDEFDEFGHIJKLMN]
    

    There can be match a single digit or match any of the capital letters A, B, C, D, E, F, G, H, I, J, K, L, M or N. It can be written in shorter form by using dash(-). Then it will be like:

    [\dA-N]
    

  2. PO SIX character classes: PO SIX are the standards to maintaining the compatibility between operating systems and defines the application programming interface(API), with command line shells and utility interfaces. It also specifies a number of “groups of characters” with a name such as (alpha, alnum, ascii, blank etc). The PO SIX character classes always exists in the form of [:class:] where class is the name and the [: and :] are the delimiters. POSIX character classes always appear inside the bracketed character classes. These classes are a convenient and explanatory way of listing a group of characters.

    Syntax:

    $string =~ /[[:class:]]/

    Here class can be alpha, alnum, ascii etc.

    POSIX character classes support larger bracketed character classes as shown below:

    [01[:Class:]%]
    

    Here it will match ‘0’, ‘1’ and any Character Classes and the percent sign. Perl provides support for different PO SIX character classes as shown below in table:

    Class Description
    alpha Any alphabetical character (“[A-Za-z]”)
    alnum Any alphanumeric character (“[A-Za-z0-9]”).
    ascii Any character in the ASCII character set.
    blank A space or a horizontal tab
    cntrl Any control character.
    digit Any decimal digit (“[0-9]”).
    graph Any printable character, excluding a space
    lower Any lowercase character (“[a-z]”)
    punct Any graphical character
    space Any whitespace character
    upper Any uppercase character (“[A-Z]”)
    xdigit Any hexadecimal digit (“[0-9a-fA-F]”)
    word A Perl extension (“[A-Za-z0-9_]”), equivalent to “\w”

  3. Word character \w[0-9a-zA-Z_]: The \w belongs to word character class. The \w matches any single alphanumeric character which may be an alphabetic character, or a decimal digit or punctuation character such as underscore(_). It will match only single character word, not the whole word. If you want to match the whole word then use \w+.

  4. Whitespace \s[\t\n\f\r ]: The character class \s will match a single character i.e. a whitespace. It will also match the 5 characters i.e. \t -horizontal tab, \n-the newline, \f-the form feed, \r-the carriage return, and the space. In Perl v5.18, a new character to be introduced which is matches the \cK – vertical tab .

  5. Negated character classes \D, \W, \S : There are more than 110, 000 Unicode characters available in this world. To negate a character class just use caret(^) symbol. It will negate the specified character after the symbol or even a range. In negated character classes we use [^\d] to negate the digits from 0 to 9. But in place of [^\d] we can use simply \D to negate the digits from 0 to 9. Following table illustrate the special negated character classes:
    Character Class Negated Meaning Description
    \d \D [^\d] matches any non-digit character
    \s \S [^\s] matches any non-whitespace character
    \w \W [^\w] matches any non-“word” character

  6. Unicode character classes: The Unicode is a definition of “all” the existing characters and the Unicode Standard provides a unique number for each and every character, and it is platform independent. There are more than 100, 000 character available in this world and each character described as a character point. But some of the characters are grouped together.

    Syntax:

    \p{...any character...}
    

    This syntax is used to match a single character from one of the groups. If you need to match anything except a specified character then you can use the corresponding \P{…any charcter…} expression.



My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.



Improved By : shubham_singh



Article Tags :
Practice Tags :


Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.