Open In App

Python RegEx

Improve
Improve
Like Article
Like
Save
Share
Report

In this tutorial, you’ll learn about RegEx and understand various regular expressions.

  • Regular Expressions
  • Why Regular Expressions
  • Basic Regular Expressions
  • More Regular Expressions
  • Compiled Regular Expressions

A RegEx is a powerful tool for matching text, based on a pre-defined pattern. It can detect the presence or absence of a text by matching it with a particular pattern, and also can split a pattern into one or more sub-patterns. The Python standard library provides a re module for regular expressions. Its primary function is to offer a search, where it takes a regular expression and a string. Here, it either returns the first match or else none.

Python3




import re
  
  
match = re.search(r'portal', 'GeeksforGeeks: A computer science \
                  portal for geeks')
print(match)
print(match.group())
  
print('Start Index:', match.start())
print('End Index:', match.end())


Output

<_sre.SRE_Match object; span=(52, 58), match='portal'>
portal
Start Index: 52
End Index: 58

Here r character (r’portal’) stands for raw, not RegEx. The raw string is slightly different from a regular string, it won’t interpret the \ character as an escape character. This is because the regular expression engine uses \ character for its own escaping purpose.

Before starting with the Python regex module let’s see how to actually write RegEx using metacharacters or special sequences. 

MetaCharacters

To understand the RE analogy, MetaCharacters are useful, important, and will be used in functions of module re. Below is the list of metacharacters.

MetaCharacters Description
\ Used to drop the special meaning of character following it
[] Represent a character class
^ Matches the beginning
$ Matches the end
. Matches any character except newline
| Means OR (Matches with any of the characters separated by it.
? Matches zero or one occurrence
* Any number of occurrences (including 0 occurrences)
+ One or more occurrences
{} Indicate the number of occurrences of a preceding RegEx to match.
() Enclose a group of RegEx

The group method returns the matching string, and the start and end method provides the starting and ending string index. Apart from this, it has so many other methods, which we will discuss later.

Why RegEx?

Let’s take a moment to understand why we should use Regular expression.

  1. Data Mining: Regular expression is the best tool for data mining. It efficiently identifies a text in a heap of text by checking with a pre-defined pattern. Some common scenarios are identifying an email, URL, or phone from a pile of text.
  2. Data Validation: Regular expression can perfectly validate data. It can include a wide array of validation processes by defining different sets of patterns. A few examples are validating phone numbers, emails, etc.

Basic RegEx

Let’s understand some of the basic regular expressions. They are as follows:

  • Character Classes
  • Rangers
  • Negation
  • Shortcuts
  • Beginning and End of String
  • Any Character

Character Classes

Character classes allow you to match a single set of characters with a possible set of characters. You can mention a character class within the square brackets. Let’s consider an example of case sensitive words. 

Python3




import re
  
  
print(re.findall(r'[Gg]eeks', 'GeeksforGeeks: \
                 A computer science portal for geeks'))


Output

['Geeks', 'Geeks', 'geeks']

Ranges

The range provides the flexibility to match a text with the help of a range pattern such as a range of numbers(0 to 9), a range of characters (A to Z), and so on. The hyphen character within the character class represents a range.

Python3




import re
  
  
print('Range',re.search(r'[a-zA-Z]', 'x'))


Output

Range <_sre.SRE_Match object; span=(0, 1), match='x'>

Negation

Negation inverts a character class. It will look for a match except for the inverted character or range of inverted characters mentioned in the character class.

Python3




import re
  
print(re.search(r'[^a-z]', 'c'))


Output

None

In the above case, we have inverted the character class that ranges from a to z. If we try to match a character within the mentioned range, the regular expression engine returns None.

Let’s consider another example

Python3




import re
  
print(re.search(r'G[^e]', 'Geeks'))


Output

None

Here it accepts any other character that follows G, other than e.

List of special sequences 

Special Sequence Description Examples
\A Matches if the string begins with the given character \Afor  for geeks
for the world
\b Matches if the word begins or ends with the given character. \b(string) will check for the beginning of the word and (string)\b will check for the ending of the word. \bge geeks
get
\B It is the opposite of the \b i.e. the string should not start or end with the given regex. \Bge together
forge
\d Matches any decimal digit, this is equivalent to the set class [0-9] \d 123
gee1
\D Matches any non-digit character, this is equivalent to the set class [^0-9] \D geeks
geek1
\s Matches any whitespace character. \s gee ks
a bc a
\S Matches any non-whitespace character \S a bd
abcd
\w Matches any alphanumeric character, this is equivalent to the class [a-zA-Z0-9_]. \w 123
geeKs4
\W Matches any non-alphanumeric character. \W >$
gee<>
\Z Matches if the string ends with the given regex ab\Z abcdab
abababab

Shortcuts

Let’s discuss some of the shortcuts provided by the regular expression engine.

  • \w – matches a word character
  • \d – matches digit character
  • \s – matches whitespace character (space, tab, newline, etc.)
  • \b – matches a zero-length character

Python3




import re
  
  
print('Geeks:', re.search(r'\bGeeks\b', 'Geeks'))
print('GeeksforGeeks:', re.search(r'\bGeeks\b', 'GeeksforGeeks'))


Output

Geeks: <_sre.SRE_Match object; span=(0, 5), match='Geeks'>
GeeksforGeeks: None

Beginning and End of String

The ^ character chooses the beginning of a string and the $ character chooses the end of a string.

Python3




import re
  
  
# Beginning of String
match = re.search(r'^Geek', 'Campus Geek of the month')
print('Beg. of String:', match)
  
match = re.search(r'^Geek', 'Geek of the month')
print('Beg. of String:', match)
  
# End of String
match = re.search(r'Geeks$', 'Compute science portal-GeeksforGeeks')
print('End of String:', match)


Output

Beg. of String: None
Beg. of String: <_sre.SRE_Match object; span=(0, 4), match='Geek'>
End of String: <_sre.SRE_Match object; span=(31, 36), match='Geeks'>

Any Character

The . character represents any single character outside a bracketed character class.

Python3




import re
  
print('Any Character', re.search(r'p.th.n', 'python 3'))


Output

Any Character <_sre.SRE_Match object; span=(0, 6), match='python'>

More RegEx

Some of the other regular expressions are as follows:

  • Optional Characters
  • Repetition
  • Shorthand
  • Grouping
  • Lookahead
  • Substitution

Optional Characters

Regular expression engine allows you to specify optional characters using the ? character. It allows a character or character class either to present once or else not to occur. Let’s consider the example of a word with an alternative spelling – color or colour.

Python3




import re
  
  
print('Color',re.search(r'colou?r', 'color')) 
print('Colour',re.search(r'colou?r', 'colour'))


Output

Color <_sre.SRE_Match object; span=(0, 5), match='color'>
Colour <_sre.SRE_Match object; span=(0, 6), match='colour'>

Repetition

Repetition enables you to repeat the same character or character class. Consider an example of a date that consists of day, month, and year. Let’s use a regular expression to identify the date (mm-dd-yyyy).

Python3




import re
  
  
print('Date{mm-dd-yyyy}:', re.search(r'[\d]{2}-[\d]{2}-[\d]{4}',
                                     '18-08-2020'))


Output

Date{mm-dd-yyyy}: <_sre.SRE_Match object; span=(0, 10), match='18-08-2020'>

Here, the regular expression engine checks for two consecutive digits. Upon finding the match, it moves to the hyphen character. After then, it checks the next two consecutive digits, and the process is repeated.  

Let’s discuss three other regular expressions under repetition.

Repetition ranges

The repetition range is useful when you have to accept one or more formats. Consider a scenario where both three digits, as well as four digits, are accepted. Let’s have a look at the regular expression.

Python3




import re
  
  
print('Three Digit:', re.search(r'[\d]{3,4}', '189'))
print('Four Digit:', re.search(r'[\d]{3,4}', '2145'))


Output

Three Digit: <_sre.SRE_Match object; span=(0, 3), match='189'>
Four Digit: <_sre.SRE_Match object; span=(0, 4), match='2145'>

Open-Ended Ranges

There are scenarios where there is no limit for a character repetition. In such scenarios, you can set the upper limit as infinitive. A common example is matching street addresses. Let’s have a look  

Python3




import re
  
  
print(re.search(r'[\d]{1,}','5th Floor, A-118,\
Sector-136, Noida, Uttar Pradesh - 201305'))


Output

<_sre.SRE_Match object; span=(0, 1), match='5'>

Shorthand

Shorthand characters allow you to use + character to specify one or more ({1,}) and * character to specify zero or more ({0,}.

Python3




import re
  
print(re.search(r'[\d]+', '5th Floor, A-118,\
Sector-136, Noida, Uttar Pradesh - 201305'))


Output

<_sre.SRE_Match object; span=(0, 1), match='5'>

Grouping

Grouping is the process of separating an expression into groups by using parentheses, and it allows you to fetch each individual matching group.  

Python3




import re
  
  
grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})', '26-08-2020')
print(grp)


Output

<_sre.SRE_Match object; span=(0, 10), match='26-08-2020'>

Let’s see some of its functionality.

Return the entire match

The re module allows you to return the entire match using the group() method

Python3




import re
  
  
grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})','26-08-2020')
print(grp.group())


Output

26-08-2020

Return a tuple of matched groups

You can use groups() method to return a tuple that holds individual matched groups

Python3




import re
  
  
grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})','26-08-2020')
print(grp.groups())


Output

('26', '08', '2020')

Retrieve a single group

Upon passing the index to a group method, you can retrieve just a single group.

Python3




import re
  
  
grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})','26-08-2020')
print(grp.group(3))


Output

2020

Name your groups

The re module allows you to name your groups. Let’s look into the syntax.

Python3




import re
  
  
match = re.search(r'(?P<dd>[\d]{2})-(?P<mm>[\d]{2})-(?P<yyyy>[\d]{4})',
                  '26-08-2020')
print(match.group('mm'))


Output

08

Individual match as a dictionary

We have seen how regular expression provides a tuple of individual groups. Not only tuple, but it can also provide individual match as a dictionary in which the name of each group acts as the dictionary key.

Python3




import re
  
  
match = re.search(r'(?P<dd>[\d]{2})-(?P<mm>[\d]{2})-(?P<yyyy>[\d]{4})',
                  '26-08-2020')
print(match.groupdict())


Output

{'dd': '26', 'mm': '08', 'yyyy': '2020'}

Lookahead

In the case of a  negated character class, it won’t match if a character is not present to check against the negated character. We can overcome this case by using lookahead; it accepts or rejects a match based on the presence or absence of content.  

Python3




import re
  
  
print('negation:', re.search(r'n[^e]', 'Python'))
print('lookahead:', re.search(r'n(?!e)', 'Python'))


Output

negation: None
lookahead: <_sre.SRE_Match object; span=(5, 6), match='n'>

Lookahead can also disqualify the match if it is not followed by a particular character. This process is called a positive lookahead, and can be achieved by simply replacing ! character with = character.

Python3




import re
  
print('positive lookahead', re.search(r'n(?=e)', 'jasmine'))


Output

positive lookahead <_sre.SRE_Match object; span=(5, 6), match='n'>

Substitution

The regular expression can replace the string and returns the replaced one using the re.sub method. It is useful when you want to avoid characters such as /, -, ., etc. before storing it to a database. It takes three arguments:

  • the regular expression
  • the replacement string
  • the source string being searched

Let’s have a look at the below code that replaces – character from a credit card number.

Python3




import re
  
print(re.sub(r'([\d]{4})-([\d]{4})-([\d]{4})-([\d]{4})',r'\1\2\3\4',
             '1111-2222-3333-4444'))


Output

1111222233334444

Compiled RegEx

The Python regular expression engine can return a compiled regular expression(RegEx) object using compile function. This object has its search method and sub-method, where a developer can reuse it when in need.  

Python3




import re
  
regex = re.compile(r'([\d]{2})-([\d]{2})-([\d]{4})')
  
# search method
print('compiled reg expr', regex.search('26-08-2020'))
  
# sub method
print(regex.sub(r'\1.\2.\3', '26-08-2020'))


Output

compiled reg expr <_sre.SRE_Match object; span=(0, 10), match=’26-08-2020′> 26.08.2020

Summary

RegEx is a powerful tool for data mining and data validation. However, avoid using regular expressions whenever you have a straightforward solution. And also, when you have to deal with complex structures such as non-trivial document format, try to use other libraries that meet the need.



Last Updated : 19 Jul, 2022
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads