In this lesson, we will learn the five most important and commonly used functions in regular expression i.e., re.search(), re.match(), re.sub(), re.finditer() and re.findall(). Apart from that, we will also discuss word boundaries in the regular expression.
Till now we have explored only one function in the ‘re’ module i.e., re.search() function. But this is not the only function we use in the regular expression. Now we will learn some other commonly used functions in the regular expression.
1. Match Function – re.match():
The Match function returns a non-empty match only if the pattern is matched or present at the very beginning of the string. As we have studied the search function so far, we found that the search function scans the pattern starting from the left of the string and keeps searching until it sees the pattern and then returns the match.
Let’s understand this function by practicing one problem using re.match() function.
Exercise 1: re.match()
Description
Write a string such that when you run the re.match() function on the string using the given regex pattern ‘\d+’, the function returns a non-empty match.
import re
pattern = '\d+'
# write a string such that the re.match() function returns a non-empty match while using the pattern 'a{2,}'
test_strings = ['dummy_user_5001','aksh123','1000_lakhs','60']
for sent in test_strings:
# check whether pattern is present in string or not
result = re.match(pattern,sent) # pass the arguments to the re.search function
if result != None:
print(sent +'->'+ ' ' + str(True))
else:
print(sent +'->'+ ' ' + str(False))
dummy_user_5001-> False aksh123-> False 1000_lakhs-> True 60-> True
As we can see from the above code, based on the pattern i.e., matching numeric digits, we found that the cases which have a numeric pattern only at the beginning of the string are matched as re.match() returns a non-empty match only if the match is present at the very beginning of the string. That’s the reason 1000_lakhs and 60 return true whereas others return false.
The next function that we are going to learn is the substitute function.
2. Substitute Function – re.sub()
The substitute function in the regular expression is used to substitute a substring with another substring of our choice.
For example, we may want to replace the American spelling ‘neighbor’ with the British spelling ‘neighbour’.
Further, we generally use re.sub() function for the text cleaning tasks. It can be used to replace all the special characters in a given string with a common string, such as SP_CHAR, to represent all the special characters in the text.
The re.sub() function is used to replace a part of your string using a regex pattern. It may also be possible that we want to replace a substring in a given input string where the substring has a particular pattern that can be matched by the regex engine and then it is replaced by the re.sub() function.
For example, if we want to replace all the digits in a communication address with a certain string let’s say ‘XXX’. Then we can do the same using the below code.
# pattern for finding all numeric digits pattern = "\d" # String which we want to replace with to_replace= "XXX" # input string in which we want to substitute input_str = "My address is 35, Napier Road Colony Part 1, Uttar Pradesh - 226003" # substitute re function re.sub(pattern, to_replace, input_str)
Exercise 2: re.sub()
Description
You are given the following string:
“You can reach us at 08584986756 or 03361562153”
Substitute all the 11-digit phone numbers present in the above string with “#”.
import re
string ="You can reach us at 08584986756 or 03361562153"
# regex pattern
pattern = "\d{11}"
# replacement string
replacement = "#"
# check whether pattern is present in string or not
result = re.sub(pattern, replacement, string)
result
'You can reach us at # or #'
As we can see from the above code we are able to substitute 11-digit phone numbers with “#” using the regex pattern and re.sub() function. The only thing is that we have to specify the correct regex pattern for substitution.
So, it is always recommended to test your regex pattern before applying it to your dataset. There are many online regex checking tools where you can test your regex and if it works then you can use it in your code for data manipulation. One of the best online regex checker tools is pythex.
Now, next, we will move to other interesting regex functions i.e., findall() and finditer() function.
Suppose, we have a huge corpus of customer data where we have to extract only specific domain’s email ids (gmail.com, outlook.com) then in that case we can use finditer() or findall() functions. Now the difference between these two is that findall() returns results in the form of a list containing all the matches whereas finditer() is used in a ‘for’ loop to iterate each of the matches one by one.
To understand them better we have to do some exercises.
Exercise 3: re.finditer()
The ‘re.finditer()’ function
Description
Write a regular expression to extract all the words from a given sentence. Then use the re.finditer() function and store all the matched words that are of length more than or equal to 7 letters in a list called output.
Sample input:
“American English spellings are based mostly on how the word sounds when it is spoken.”
Expected output:
3
import re
#given string
string='''American English spellings are based mostly on
how the word sounds when it is spoken.
'''
# regex pattern
pattern = '\w+'
# store results in the list 'output'
output = []
# iterate over the matches
for match in re.finditer(pattern,string):
if len(match.group()) >= 7:
output.append(match)
else:
continue
# printing the length of matched list length
print(len(output))
3
From the above code, we have iterated over all the words in the string, the matched pattern of each of the words, and where a match is found we place those words in a list whose length we have calculated in the end.
Exercise 4: re.findall()
Description
Write a regular expression to extract all the words that have the suffix ‘ing’ using the re.findall() function. Store the matches in the variable ‘output’ and print its length.
Sample input:
“Ramesh was singing a song while everybody was laughing!!”
Expected output:
2
import re
#given string
string='''Ramesh was singing a song while everybody was laughing!!
'''
# regex pattern
pattern = r'\b(\w+ing)\b'
# store results in the list 'output'
output = re.findall(pattern, string)
print(output)
['singing', 'laughing']
print(len(output))
2
In the above code, we have used two interesting things first is ‘r’ before the pattern and the second one is ‘\b’ around the pattern which we haven’t studied yet.
The ‘r’ at the start of the pattern means that the string is to be treated as a raw string, which means all escape codes signifying special meaning i.e., ‘\b’,’\n’ etc., will be ignored. If you want to know more kindly refer to this python document.
The ‘\b’ around the pattern is actually used for word boundary i.e., it designates the boundary between word and non-word characters.
For example, regex \bcap\b will match cap in a blue cap, but it will not match in the cases such as capsized, incapable, or capacity. It means that anything wrapped inside \b should be a proper word. Now on an interesting note if we remove one of the boundaries, \bcap now it will match cap in capsized, capacity, and cap\b will match cap in icecap, midcap, etc.,
Now, in this lesson, we have studied the five most commonly used regex functions. In the next lesson, we will learn grouping in the regular expression.