In this lesson of the module Lexical Processing, we will discuss different techniques we generally employ in the basic lexical processing of the text. Before jumping to the techniques directly, we should first understand the overall concept of lexical processing.
Lexical Processing:
First, we will generally convert the raw text into words or tokens and, depending on our requirements, convert it into sentences or paragraphs as well.
- For example, if an email contains words such as prize, credit card, and money, then the email is represented by these words, and it is likely to be a phishing email.
- Hence, in general, the distribution of words present in a sentence gives us a clear idea about what this sentence means. Apart from that many other processing steps are usually undertaken in order to make this distribution of words more representative of the sentence, for example, ‘prizes’ and ‘prize’ are considered to be the same word. In general, we can convert all plural words to their singular form as they represent the same contextual meaning in the sentence.
- For a simple application such as phishing email detection, lexical processing will work fine, but it will fail in more complex applications, i.e., machine translation and text summarization. For example, the sentences “My second payment is due for this week” and “My payment is due for this week the second time”, look similar based on distribution but they have very different meanings because of the order of words in a sentence. However, lexical processing will treat the two sentences as equal, as the “distribution of words” in both sentences is the same. Hence, we definitely need a more advanced system for handling these types of cases.
In this lesson, we will learn one of the most popular techniques in lexical processing i.e., Regular Expressions also referred to as Regex.
Regular Expression
The regular expression is a very efficient tool for extracting required information from the text. It is a set of characters, also known as the pattern, which helps in finding substrings from a given string. The pattern is used to extract the substrings.
For example, suppose you have a customer registration page and you want to validate whether it’s a valid email or not.
Similarly, if you are also collecting mobile numbers on the customer registration page and you want to validate whether the user is entering a valid phone number or not.
Regular expressions are a very powerful tool in text processing. It will help in cleaning and handling text in a much better way.
To use a regular expression we have to install the re library available in Python. In Python version 3.6+ re is already pre-installed so you don’t have to install it explicitly, you just have to import it.
Let’s start using regular expressions in Python. In the below exercise we will use ‘re.search()’ function of the regular expression. This function expects two parameters pattern and string, where pattern denotes regex pattern based on our requirements and string is the given input string in which we have to search the pattern.
result = re.search(pattern, string)
Exercise 1
Description
Consider the following sentence: “A regular expression is a sequence of characters that specifies a search pattern in the text.“
Write a regular expression pattern to check whether the word ‘sequence’ is present in the given string or not by using re.search() function. The ‘re.search()’ method returns a RegexObject if the pattern is found in the string, else it returns a None object.
# import regular expression library
import re
# input string on which to test regex pattern
string = 'A regular expression is a sequence of characters that specifies a search pattern in the text.'
# regex pattern to check if 'sequence' is present in a input string or not.
pattern = 'sequence'
# check whether pattern is present in string or not
result = re.search(pattern, string)
# evaluate result - if there is a pattern return True else return False
if result != None:
print(True)
else:
print(False)
True
Exercise 2
Description
Consider the same problem as the above. Extract the word ‘sequence’ from the sentence “A regular expression is a sequence of characters that specifies a search pattern in the text.“. But this time, extract the starting position of the ‘sequence’ using result.start().
# import regular expression library
import re
# input string on which to test regex pattern
string = 'A regular expression is a sequence of characters that specifies a search pattern in the text.'
# regex pattern to check if 'sequence' is present in a input string or not.
pattern = 'sequence'
# check whether pattern is present in string or not
result = re.search(pattern, string)
# store the start of the match
start_pos= result.start()
# print the result
print(start_pos)
26
Exercise 3
Description
Consider the same problem as described in the first question. Extract the word ‘sequence’ from the sentence ‘A regular expression is a sequence of characters that specifies a search pattern in the text.‘. But this time extract the end position of the match using result.end().
# import regular expression library
import re
# input string on which to test regex pattern
string = 'A regular expression is a sequence of characters that specifies a search pattern in the text.'
# regex pattern to check if 'sequence' is present in a input string or not.
pattern = 'sequence'
# check whether pattern is present in string or not
result = re.search(pattern, string)
# store the end position of the match
end_pos= result.end()
# print the result
print(end_pos)
34
So from the above exercises, we now understand how to import regular expressions library in python and how to use it.
Further, we have learned to use the re.search() function and also used two of its methods – match.start() and match.end() which returns the index of the starting and ending position of the match.
Next, we will learn Quantifiers in the regular expression. Quantifiers specify the number of times a character(s) must be present in the input string for a match to be found.
There are the following types of quantifiers in regular expressions:
Quantifier | Name | Meaning |
‘?’ | Question mark | Matches the preceding character zero or one time. It is generally used to mark the optional presence of a character. |
‘*’ | Asterisk | Matches the preceding character zero or one time. It is generally used to mark the repeatable occurrence of a character. |
‘+’ | Plus | Matches the preceding character one or more times. That means the preceding character has to be present at least once for the pattern to match the string. |
‘{m, n}’ | Curly braces | Matches the preceding character ‘m’ times to ‘n’ times. |
‘{m, }‘ | Curly braces | Matches the preceding character ‘m’ times to infinite times i.e., the upper limit is not fixed. |
‘{, n}‘ | Curly braces | Matches the preceding character from zero to ‘n’ times i.e., the upper limit is fixed. |
‘{n}‘ | Curly braces | Matches if the preceding character occurs exactly ‘n’ number of times. |
So, as we have seen different quantifiers and their meaning now it’s time to practice them so that we can understand them better and can use them fluently in real-time.
The first quantifier that we will be going to study is “?”
Exercise 4: Use of Quantifier ‘?’
Description
Write a regular expression that matches the word ‘coin’ or ‘coins’ in a given piece of text.
Sample positive cases:
‘She pulled out her coin purse.’
‘She flipped both the coins and looked down.’
Sample Negative cases:
‘It’s very humid outside’
‘you must not ask for more rupees’
# import regular expression library
import re
# input string on which to test regex pattern
sample_sent = ['She pulled out her coin purse.',
'She flipped both the coins and looked down.',
"It's very humid outside",
'you must not ask for more rupees']
pattern = 'coins?'
# iterating list of sample sentences to print match status as true or false
for sent in sample_sent:
# check whether pattern is present in string or not
result = re.search(pattern,sent) # pass the arguments to the re.search function
if result != None:
print(True)
else:
print(False)
True True False False
As we can see from the above code we are able to match the sentences having ‘coin’ or ‘coins’ in the first two input sample sentences while for the other two sentences it returned false as it does not contain the pattern.
Until this point hope you have some clarity about using the ‘?’ quantifier in the regular expression.
Exercise 5
Description
Write a regular expression that matches the following words:
- abc
- ab
- ac
- a
Make sure that the regular expression doesn’t match the following words:
- Abbc
- Abcc
- Abb
- Acc
- Bc
# import regular expression library
import re
# input string on which to test regex pattern
sample_sent = ['abc', 'ab', 'ac','a','Abbc', 'Abcc','Abb','Acc','bc']
pattern = 'ab?c?'
# iterating list of sample sentences to print match status as true or false
for sent in sample_sent:
# check whether pattern is present in string or not
result = re.search(pattern,sent) # pass the arguments to the re.search function
if result != None:
print(True)
else:
print(False)
True True True True False False False False False
From the above code, we can see that the regex pattern was successfully able to match the first four sample sentences based on the given condition whereas for the last four cases it returned false.
The next quantifier that we will be going to learn is the ‘*’ quantifier.
Exercise 6: Use of Quantifier ‘*’
Description
Match a binary number that starts with 010 and ends with zero or more number of ones.
Sample positive cases (pattern should match all of these):
0101
01011
010111
010
Sample negative cases (shouldn’t match any of these):
01
011
0
# import regular expression library
import re
# input string on which to test regex pattern
sample_sent = ['0101', '01011', '010111','010','01', '011','0']
pattern = '0101*'
# iterating list of sample sentences to print match status as true or false
for sent in sample_sent:
# check whether pattern is present in string or not
result = re.search(pattern,sent) # pass the arguments to the re.search function
if result != None:
print(True)
else:
print(False)
True True True True False False False
So, until this point we have covered two quantifiers ‘?’ and ‘*’ and also solved some exercises for clearing our concept.
Now, next, we will cover our third quantifier i.e., ‘+’. Before jumping to this quantifier we should clarify one thing that usually people confuse with that is what’s the difference between ‘*’ and ‘+’. So the main difference between these two quantifiers is that ‘+’ requires the preceding character must be present at least 1 time while ‘*’ doesn’t require this that is preceding character may be present zero or more times.
Exercise 7: Use of Quantifier ‘+’
Description
Write a pattern that matches numbers that are a power of 10.
Sample positive matches (should match all of the following):
- 10
- 100
- 1000
Sample negative matches (shouldn’t match either of these):
- 0
- 1
- 15
# import regular expression library
import re
# input string on which to test regex pattern
sample_sent = ['10', '100', '1000','0','1', '15']
pattern = '10+'
# iterating list of sample sentences to print match status as true or false
for sent in sample_sent:
# check whether pattern is present in string or not
result = re.search(pattern,sent) # pass the arguments to the re.search function
if result != None:
print(True)
else:
print(False)
True True True False False False
So, as we can see from the code the pattern ’10+’ correctly matches with all the strings having the power of 10.
To summarise, until now we have studied the following quantifiers:
- ‘?’: Optional preceding character
- ‘*’: Match preceding character zero or more times
- ‘+’: Match preceding character one or more times (i.e. at least once)
But if we want to look for a character that appears exactly 3 times, or between 2-4 times? then we cannot able to match using the quantifiers we have studied so far.
Hence, the next quantifier that we will be learning helps us to specify occurrences of the preceding character a fixed number of times.
There are four variants of the quantifier {m,n}:
- {m, n}: Matches the preceding character ‘m’ times to ‘n’ times.
- {m, }: Matches the preceding character ‘m’ times to infinite times.
- {, n}: Matches the preceding character from zero to ‘n’ times.
- {n}: Matches if the preceding character occurs exactly ‘n’ number of times.
It is to be noted that this quantifier can replace the ‘?’, ‘*’, and the ‘+’ quantifier in the following ways:
- ‘?’ is equivalent to zero or once, or {0, 1}
- ‘*’ is equivalent to zero or more times, or {0, }
- ‘+’ is equivalent to one or more times, or {1, }
Exercise 8: Use of Quantifier ‘{m,n}’
Description
Write a regular expression to match the word ‘allergy’. But match only those variants of the word where there are a minimum of two ‘l’s and a maximum of five ‘l’s.
# import regular expression library
import re
# input string on which to test regex pattern
sample_sent = ['allergy', 'alllergy', 'aeray','alay']
pattern = 'al{2,5}ergy'
# iterating list of sample sentences to print match status as true or false
for sent in sample_sent:
# check whether pattern is present in string or not
result = re.search(pattern,sent) # pass the arguments to the re.search function
if result != None:
print(True)
else:
print(False)
True True False False
Description
Write a regular expression that matches variants of the word ‘income’ where there are more than two ‘e’s at the end of the word.
The following strings should match:
Incomeee
incomeeee
The following strings shouldn’t match:
Incom
Income
# import regular expression library
import re
# input string on which to test regex pattern
sample_sent = ['Incomeee', 'incomeeee', 'Incom','Income']
pattern = 'I?i?ncome{2,}'
# iterating list of sample sentences to print match status as true or false
for sent in sample_sent:
# check whether pattern is present in string or not
result = re.search(pattern,sent) # pass the arguments to the re.search function
if result != None:
print(True)
else:
print(False)
True True False False
So, in this lesson, we have covered four main types of quantifiers and also practiced some exercises. In the next lesson, we will resume Regular expression quantifier lesson and will learn to handle whitespace, special characters, grouping, pipe operator, regex flags, and compile function.
Proceed to Lesson 3 – Regular Expression: Quantifier Part 2
Now, we are giving you practice questions to test your understanding of the quantifiers so far.
Practice Exercise: Quantifier
Write a regular expression to check whether the URL contains ‘http’ or ‘https’ in a given input URL.
Write a regular expression to check whether a word begins with the letter ‘c’ followed by zero or one instance of the letter ‘a’
Write a pattern that starts with 1 and ends with zero but has an arbitrary number of 1s (zero or more) in between
Write a regular expression that matches a string where ‘a’ is followed by ‘b’ a maximum of three times
Write a regular expression to match a term that has three or more ‘0’s followed by one or more ‘1’s