Regular Expressions in Python

In Python, one could easily check the presence of a text string within another string. But, in some scenarios, we may not have the exact text to match. For example, what if you want to check whether any valid email address is present. This is where regular expression plays its part. In this section, we explore regular expressions and understand various regular expressions.

  • Regular Expressions
  • Why Regular Expressions
  • Basic Regular Expressions
  • More Regular Expressions
  • Compiled Regular Expressions

Regular Expressions

A regular expression is a powerful tool for matching text, based on a pre-defined pattern. It can detect the presence or absence of a text by matching with a particular pattern, and also can split a pattern into one or more sub-patterns. The Python standard library provides a re module for regular expressions. Its primary function is to offer a search, where it takes a regular expression and a string. Here, it either returns the first match or else none.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
  
match = re.search(r'portal', 'GeeksforGeeks: A computer science \
                  portal for geeks')
print(match)
print(match.group())
  
print('Start Index:', match.start())
print('End Index:', match.end())

chevron_right


Output

<_sre.SRE_Match object; span=(52, 58), match='portal'>
portal
Start Index: 52
End Index: 58

Here r character (r’portal’) stands for raw, not regex. The raw string is slightly different from a regular string, it won’t interpret the \ character as an escape character. This is because the regular expression engine uses \ character for its own escaping purpose.

The group method returns the matching string, and the start and end method provides the starting and ending string index. Apart from this, it has so many other methods, which we will discuss later.



Why Regular Expressions?

Let’s take a moment to understand why we should use Regular expression.

  1. Data Mining: Regular expression is the best tool for data mining. It efficiently identifies a text in a heap of text by checking with a pre-defined pattern. Some common scenarios are identifying an email, URL, or phone from a pile of text.
  2. Data Validation: Regular expression can perfectly validate data. It can include a wide array of validation processes by defining different sets of patterns. A few examples are validating phone numbers, emails, etc.

Basic Regular Expressions

Let’s understand some of the basic regular expressions. They are as follows:

  • Character Classes
  • Rangers
  • Negation
  • Shortcuts
  • Beginning and End of String
  • Any Character

Character Classes

Character classes allow you to match a single set of characters with a possible set of characters. You can mention a character class within the square brackets. Let’s consider an example of case sensitive words. 

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
  
print(re.findall(r'[Gg]eeks', 'GeeksforGeeks: \
                 A computer science portal for geeks'))

chevron_right


Output

['Geeks', 'Geeks', 'geeks']

Ranges

The range provides the flexibility to match a text with the help of a range pattern such as a range of numbers(0 to 9), a range of characters (A to Z), and so on. The hyphen character within the character class represents a range.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
  
print('Range',re.search(r'[a-zA-Z]', 'x'))

chevron_right


Output

Range <_sre.SRE_Match object; span=(0, 1), match='x'>

Negation

Negation inverts a character class. It will look for a match except for the inverted character or range of inverted characters mentioned in the character class.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
print(re.search(r'[^a-z]', 'c'))

chevron_right


Output



None

In the above case, we have inverted the character class that ranges from a to z. If we try to match a character within the mentioned range, the regular expression engine returns None.

Let’s consider another example

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
print(re.search(r'G[^e]', 'Geeks'))

chevron_right


Output

None

Here it accepts any other character that follows G, other than e.

Shortcuts

Let’s discuss some of the shortcuts provided by the regular expression engine.

  • \w – matches a word character
  • \d – matches digit character
  • \s – matches whitespace character (space, tab, newline, etc.)
  • \b – matches a zero-length character

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
  
print('Geeks:', re.search(r'\bGeeks\b', 'Geeks'))
print('GeeksforGeeks:', re.search(r'\bGeeks\b', 'GeeksforGeeks'))

chevron_right


Output

Geeks: <_sre.SRE_Match object; span=(0, 5), match='Geeks'>
GeeksforGeeks: None

Beginning and End of String

The ^ character chooses the beginning of a string and the $ character chooses the end of a string.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
  
# Beginning of String
match = re.search(r'^Geek', 'Campus Geek of the month')
print('Beg. of String:', match)
  
match = re.search(r'^Geek', 'Geek of the month')
print('Beg. of String:', match)
  
# End of String
match = re.search(r'Geeks$', 'Compute science portal-GeeksforGeeks')
print('End of String:', match)

chevron_right


Output

Beg. of String: None
Beg. of String: <_sre.SRE_Match object; span=(0, 4), match='Geek'>
End of String: <_sre.SRE_Match object; span=(31, 36), match='Geeks'>

Any Character

The . character represents any single character outside a bracketed character class.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
print('Any Character', re.search(r'p.th.n', 'python 3'))

chevron_right


Output



Any Character <_sre.SRE_Match object; span=(0, 6), match='python'>

More Regular Expressions

Some of the other regular expressions are as follows:

  • Optional Characters
  • Repetition
  • Shorthand
  • Grouping
  • Lookahead
  • Substitution

Optional Characters

Regular expression engine allows you to specify optional characters using the ? character. It allows a character or character class either to present once or else not to occur. Let’s consider the example of a word with an alternative spelling – color or colour.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
  
print('Color',re.search(r'colou?r', 'color')) 
print('Colour',re.search(r'colou?r', 'colour'))

chevron_right


Output

Color <_sre.SRE_Match object; span=(0, 5), match='color'>
Colour <_sre.SRE_Match object; span=(0, 6), match='colour'>

Repetition

Repetition enables you to repeat the same character or character class. Consider an example of a date that consists of day, month, and year. Let’s use a regular expression to identify the date (mm-dd-yyyy).

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
  
print('Date{mm-dd-yyyy}:', re.search(r'[\d]{2}-[\d]{2}-[\d]{4}',
                                     '18-08-2020'))

chevron_right


Output

Date{mm-dd-yyyy}: <_sre.SRE_Match object; span=(0, 10), match='18-08-2020'>

Here, the regular expression engine checks for two consecutive digits. Upon finding the match, it moves to the hyphen character. After then, it checks the next two consecutive digits, and the process is repeated.  

Let’s discuss three other regular expressions under repetition.

Repetition ranges

The repetition range is useful when you have to accept one or more formats. Consider a scenario where both three digits, as well as four digits, are accepted. Let’s have a look at the regular expression.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
  
print('Three Digit:', re.search(r'[\d]{3,4}', '189'))
print('Four Digit:', re.search(r'[\d]{3,4}', '2145'))

chevron_right


Output

Three Digit: <_sre.SRE_Match object; span=(0, 3), match='189'>
Four Digit: <_sre.SRE_Match object; span=(0, 4), match='2145'>

Open-Ended Ranges

There are scenarios where there is no limit for a character repetition. In such scenarios, you can set the upper limit as infinitive. A common example is matching street addresses. Let’s have a look  



Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
  
print(re.search(r'[\d]{1,}','5th Floor, A-118,\
Sector-136, Noida, Uttar Pradesh - 201305'))

chevron_right


Output

<_sre.SRE_Match object; span=(0, 1), match='5'>

Shorthand

Shorthand characters allow you to use + character to specify one or more ({1,}) and * character to specify zero or more ({0,}.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
print(re.search(r'[\d]+', '5th Floor, A-118,\
Sector-136, Noida, Uttar Pradesh - 201305'))

chevron_right


Output

<_sre.SRE_Match object; span=(0, 1), match='5'>

Grouping

Grouping is the process of separating an expression into groups by using parentheses, and it allows you to fetch each individual matching group.  

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
  
grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})', '26-08-2020')
print(grp)

chevron_right


Output

<_sre.SRE_Match object; span=(0, 10), match='26-08-2020'>

Let’s see some of its functionality.

Return the entire match

The re module allows you to return the entire match using the group() method

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
  
grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})','26-08-2020')
print(grp.group())

chevron_right


Output

26-08-2020

Return a tuple of matched groups

You can use groups() method to return a tuple that holds individual matched groups

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
  
grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})','26-08-2020')
print(grp.groups())

chevron_right


Output



('26', '08', '2020')

Retrieve a single group

Upon passing the index to a group method, you can retrieve just a single group.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
  
grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})','26-08-2020')
print(grp.group(3))

chevron_right


Output

2020

Name your groups

The re module allows you to name your groups. Let’s look into the syntax.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
  
match = re.search(r'(?P<dd>[\d]{2})-(?P<mm>[\d]{2})-(?P<yyyy>[\d]{4})',
                  '26-08-2020')
print(match.group('mm'))

chevron_right


Output

08

Individual match as a dictionary

We have seen how regular expression provides a tuple of individual groups. Not only tuple, but it can also provide individual match as a dictionary in which the name of each group acts as the dictionary key.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
  
match = re.search(r'(?P<dd>[\d]{2})-(?P<mm>[\d]{2})-(?P<yyyy>[\d]{4})',
                  '26-08-2020')
print(match.groupdict())

chevron_right


Output

{'dd': '26', 'mm': '08', 'yyyy': '2020'}

Lookahead

In the case of a  negated character class, it won’t match if a character is not present to check against the negated character. We can overcome this case by using lookahead; it accepts or rejects a match based on the presence or absence of content.  

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
  
print('negation:', re.search(r'n[^e]', 'Python'))
print('lookahead:', re.search(r'n(?!e)', 'Python'))

chevron_right


Output

negation: None
lookahead: <_sre.SRE_Match object; span=(5, 6), match='n'>

Lookahead can also disqualify the match if it is not followed by a particular character. This process is called a positive lookahead, and can be achieved by simply replacing ! character with = character.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
print('positive lookahead', re.search(r'n(?=e)', 'jasmine'))

chevron_right


Output



positive lookahead <_sre.SRE_Match object; span=(5, 6), match='n'>

Substitution

The regular expression can replace the string and returns the replaced one using the re.sub method. It is useful when you want to avoid characters such as /, -, ., etc. before storing it to a database. It takes three arguments:

  • the regular expression
  • the replacement string
  • the source string being searched

Let’s have a look at the below code that replaces – character from a credit card number.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
print(re.sub(r'([\d]{4})-([\d]{4})-([\d]{4})-([\d]{4})',r'\1\2\3\4',
             '1111-2222-3333-4444'))

chevron_right


Output

1111222233334444

Compiled Regular Expressions

The Python regular expression engine can return a compiled regular expression object using compile function. This object has its search method and sub-method, where a developer can reuse it when in need.  

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
  
  
regex = re.compile(r'([\d]{2})-([\d]{2})-([\d]{4})')
  
# search method
print('compiled reg expr', regex.search('26-08-2020'))
  
# sub method
print(regex.sub(r'\1.\2.\3', '26-08-2020'))

chevron_right


Output

compiled reg expr <_sre.SRE_Match object; span=(0, 10), match=’26-08-2020′>
26.08.2020

Summary

Regular Expressions are a powerful tool for data mining and data validation. However, avoid using regular expressions whenever you have a straightforward solution. And also, when you have to deal with complex structures such as non-trivial document format, try to use other libraries that meet the need.

Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.




My Personal Notes arrow_drop_up

Focused on developing machine learning models and constantly doing research on the complex business challenges to solve problems and deliver valuable insights Expertise includes • Python Programming • Probability and Statistics • Data Modelling and Evaluation • Machine learning algorithms

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :

Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.