In this lesson of the module Lexical Processing, we will discuss different techniques we generally employ in the basic lexical processing of the text. Before jumping to the techniques directly, we should first understand the overall concept of lexical processing.
First, we will generally convert the raw text into words or tokens and, depending on our requirements, convert it into sentences or paragraphs as well.
- For example, if an email contains words such as prize, credit card, and money, then the email is represented by these words, and it is likely to be a phishing email.
- Hence, in general, the distribution of words present in a sentence gives us a clear idea about what this sentence means. Apart from that many other processing steps are usually undertaken in order to make this distribution of words more representative of the sentence, for example, ‘prizes’ and ‘prize’ are considered to be the same word. In general, we can convert all plural words to their singular form as they represent the same contextual meaning in the sentence.
- For a simple application such as phishing email detection, lexical processing will work fine, but it will fail in more complex applications, i.e., machine translation and text summarization. For example, the sentences “My second payment is due for this week” and “My payment is due for this week the second time”, look similar based on distribution but they have very different meanings because of the order of words in a sentence. However, lexical processing will treat the two sentences as equal, as the “distribution of words” in both sentences is the same. Hence, we definitely need a more advanced system for handling these types of cases.
In this lesson, we will learn one of the most popular techniques in lexical processing i.e., Regular Expressions also referred to as Regex.
The regular expression is a very efficient tool for extracting required information from the text. It is a set of characters, also known as the pattern, which helps in finding substrings from a given string. The pattern is used to extract the substrings.
For example, suppose you have a customer registration page and you want to validate whether it’s a valid email or not.
Similarly, if you are also collecting mobile numbers on the customer registration page and you want to validate whether the user is entering a valid phone number or not.
Regular expressions are a very powerful tool in text processing. It will help in cleaning and handling text in a much better way.
To use a regular expression we have to install the re library available in Python. In Python version 3.6+ re is already pre-installed so you don’t have to install it explicitly, you just have to import it.
Let’s start using regular expressions in Python. In the below exercise we will use ‘re.search()’ function of the regular expression. This function expects two parameters pattern and string, where pattern denotes regex pattern based on our requirements and string is the given input string in which we have to search the pattern.
result = re.search(pattern, string)
Consider the following sentence: “A regular expression is a sequence of characters that specifies a search pattern in the text.“
Write a regular expression pattern to check whether the word ‘sequence’ is present in the given string or not by using re.search() function. The ‘re.search()’ method returns a RegexObject if the pattern is found in the string, else it returns a None object.
# import regular expression library
# input string on which to test regex pattern
string = 'A regular expression is a sequence of characters that specifies a search pattern in the text.'
# regex pattern to check if 'sequence' is present in a input string or not.
pattern = 'sequence'
# check whether pattern is present in string or not
result = re.search(pattern, string)
# evaluate result - if there is a pattern return True else return False
if result != None:
Consider the same problem as the above. Extract the word ‘sequence’ from the sentence “A regular expression is a sequence of characters that specifies a search pattern in the text.“. But this time, extract the starting position of the ‘sequence’ using result.start().