In this lesson, we will learn how to use grouping in the regular expression. Grouping is basically used if we want to extract sub-patterns from the larger part of the match.
Let’s consider the situation in which we have a list of the date of birth of all the employees of a company and we only want the month and year from the dates. This can be achieved by using a regular expression with grouping to match all the dates and then from those extracted matched dates we can further extract different components such as day, month, and year.
For using the grouping in regular expression we have to place our pattern inside of the parenthesis.
For example, let’s assume we have a source string: “Ramesh’s date of birth is 18/02/1980”.
Now, if we want to extract month and year from the above string we have to write a regex pattern like the below:
pattern = "\d{1,2}/(\d{1,2})/(\d{4})"
Here, we have placed parenthesis around the pattern of month and year because we have to extract them from the dates. Let’s see the code below:
import re
string ="Ramesh's date of birth is 18/02/1980"
pattern = "\d{1,2}/(\d{1,2})/(\d{4})"
# store result
result = re.search(pattern,string) # pass the parameters to the re.search() function
# evaluate result - don't change the following piece of code, it is used to evaluate your regex
if result != None:
print(result.group(0)) # result.group(0) will output the entire match
else:
print(False)
18/02/1980
result.group(0)
'18/02/1980'
result.group(1)
'02'
result.group(2)
'1980'
Now as we can see from the above output, group(0) contains the default match i.e., the whole match based on the regex pattern whereas group(1) and group(2) contain the match for the month and the year respectively.
Let’s practice one more exercise to understand grouping better.
Exercise 1
Description
Write a regular expression to extract the domain name from an email address present in a raw string.
Sample input:
wisdomml2020@gmail.com
Expected output:
gmail.com
import re
string = 'Our company email address is wisdomml2020@gmail.com'
# regex pattern
pattern = "\w+@([A-z]+\.com)"
# store result
result = re.search(pattern, string)
result.group(0)
'wisdomml2020@gmail.com'
result.group(1)
'gmail.com'
result.group(2)
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-8-ced712dbcbc6> in <module> ----> 1 result.group(2) IndexError: no such group
From the above code output, we can see the power of grouping. Based on the regex pattern for identifying email addresses we have placed the domain part under parenthesis for using grouping functionality.
In group(0) the default match is the full email id whereas group(1) contains the actual domain name i.e., gmail.com while group(2) given the error of “no such group” because we have only one group defined in the pattern.
So, now we have learnt almost all the important concepts in regular expressions.
Therefore now we have to apply all the concepts we have learnt so far in solving real-time use cases.
In the next section, we will learn about tokenization in natural language processing.