Spell corrector is a natural language processing (NLP) technique that is used to automatically correct spelling errors in the text. It is a popular application of NLP and can be useful in a variety of contexts, including social media, online search, and email.
Spell correction is a very critical part of lexical processing. In the majority of applications, spell correction is the first step to be performed in the preprocessing layer.
For example, if we are developing a chatbot to book airline flights, and we get the request from the user ‘Book a flight from Mumbai to Kolkatt‘, then the chatbot should gracefully handle that spelling error and return relevant results.
In this project, we are going to learn Norvig’s spell corrector method which usually gives a really good performance in terms of its accuracy.
Spell Corrector Implementation in Python
import re from collections import Counter def words(document): "Convert text to lower case and tokenize the document" return re.findall(r'\w+', document.lower()) # create a frequency table of all the words of the document all_words = Counter(words(open('big.txt').read())) def edits_one(word): "Create all edits that are one edit away from `word`." alphabets = 'abcdefghijklmnopqrstuvwxyz' splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [left + right[1:] for left, right in splits if right] inserts = [left + c + right for left, right in splits for c in alphabets] replaces = [left + c + right[1:] for left, right in splits if right for c in alphabets] transposes = [left + right[1] + right[0] + right[2:] for left, right in splits if len(right)>1] return set(deletes + inserts + replaces + transposes) def edits_two(word): "Create all edits that are two edits away from `word`." return (e2 for e1 in edits_one(word) for e2 in edits_one(e1)) def known(words): "The subset of `words` that appear in the `all_words`." return set(word for word in words if word in all_words) def possible_corrections(word): "Generate possible spelling corrections for word." return (known([word]) or known(edits_one(word)) or known(edits_two(word)) or [word]) def prob(word, N=sum(all_words.values())): "Probability of `word`: Number of appearances of 'word' / total number of tokens" return all_words[word] / N def rectify(word): "return the most probable spelling correction for `word` out of all the `possible_corrections`" correct_word = max(possible_corrections(word), key=prob) return correct_word
from spell_corrector import rectify correct = rectify("fdind") print(correct)
Now let’s understand the working of each of the functions in detail.
The function words() tokenize the document that’s passed into it. Next, the ‘Counter’ class, creates a frequency distribution of the words present in the seed document named ‘big.txt’. Each word is stored in the dictionary along with its count. This seed document ‘big.txt’ is a book ‘The Adventures of Sherlock Holmes’ present in textual format on Gutenberg’s website.
Here, this seed document acts like a lookup dictionary for the spell corrector as it contains the correct spellings of almost every word.
The edits_one() function creates all the possible words that are one edit distance away from the input word. But it also contains words that are not valid English words. For example, if we pass the word ‘laern’ (misspelling of the word ‘learn’) to edits_one(), it will create a list where the word ‘lgern’ will also be present since it is an edit away from the word ‘laern’. However, we know that it’s not a valid English word.
Similarly, the edits_two() function creates a list of all the possible words that are two edits away from the input word. But it will also contain some garbage words.
Next, the known() function filters out the non-dictionary English words (garbage words) from a list of given words. This function works by referring to the seed document and uses the frequency distribution of words as a dictionary that is present in the seed document. It will simply discard the output words of edits_one() and edits_two() functions that are not present in the dictionary.
The function possible_corrections() will return a list of all the possible words that can be the correct alternative spelling. For example, let’s say the user has typed the wrong word ‘fina’. The output of this function would be ‘final’, ‘find’, ‘nina’, ‘fine’.
Now, let’s understand the stepwise working of this function possible_correction() which is working as follows:
- Firstly, checks if the word is present in the dictionary or not. If the word is present, it returns no spelling suggestions since it is already a correct dictionary word.
- If the word is not present in the dictionary, then it creates a list of all the known words that are one edit distance away. If there are no valid English words in the list created by the edits_one() function then it will fetch a list of all possible English words that are two edits away from the input word using the edits_two() function.
- If there are no valid words that are two edits away, then the function returns the original input word. This means that there are no alternative words that the spell corrector could suggest.
Finally, there is a prob() function. The function returns the probability of the most likely correct word with the highest probability. That’s the main reason we have used a seed document instead of a dictionary. A dictionary only contains a list of all correct English words. But, a seed document contains all the correct words along with the frequency distribution of all these words. We have used this frequency as our basis in order to suggest one word when there is more than one possible correct word for a given misspelled word.
We can integrate this spell corrector code into various applications in order to correct spellings in our daily day-to-day work.
1 thought on “Building of Spell Corrector in NLP using Python”