Canonicalization in NLP is simply reducing the word to its base form or root form. Stemming & lemmatization are examples of canonicalization techniques. Stemming is used to reduce a word to its base form whereas Lemmatization is used to reduce a word to its base lemma. The base forms and the lemma are nothing but the roots of the inflected words.
But there are certain scenarios in real-time handling of text data where stemming and lemmatization will not work i.e., the problem of misspellings . The issue of misspellings is very common nowadays, especially when we are using textual data from social media which makes working with text extremely difficult.
Let us consider a scenario in which we are working on a text corpus that contains misspelled words. For example, the corpus contains two misspelled versions of the word ‘retrieving’ – ‘retreiving’ and ’retreeving’.
If we stem these words, we’ll have two different stems i.e., ‘retreiv’ and ‘retreev’. But we still have the problem of redundant tokens. While on the other hand, lemmatization won’t work on these two misspelled words and will return only the same words because it works on words having correct dictionary spelling.
To handle misspellings in textual data, we’ll need to canonicalise it by correcting the spellings of the word. Then only we can perform either stemming or lemmatization.
In general, there are situations where certain words have different pronunciations in different languages. As a result, their spellings also differ accordingly. Examples of such instances include words having names of people, city names, names of dishes, etc.
For example, the capital of India is New Delhi. However, Delhi is also pronounced as Dilli in Hindi pronunciation. Hence, we may find both variants in a raw text corpus. Similarly, the surname ‘Srivastava’ has various spellings and pronunciations like ‘Shrivastava’, ‘Srivastav’, etc.
So, if we perform stemming or lemmatization on these words, it will not help us as much as the issue of redundant tokens will still remain. Therefore, we need to reduce all the variations of a particular word to a common word. To achieve this we have to use the Phonetic Hashing technique.
Phonetic hashing buckets all the similar phonemes (words having similar sounds or pronunciations) into a single bucket and assign them a unique single hash code. Hence, the word ‘Dilli’ and ‘Delhi’ will have the same hash code. In the next article, we will study in detail how we can perform phonetic hashing.
After dealing with similar sounding words still, there may be a scenario of misspellings. As already discussed, misspellings need to be corrected in order to stem or lemmatize efficiently. Nowadays, the problem of misspellings is so common, especially in textual data from social media, that it makes working with text extremely difficult, if not dealt with properly.
So, there is a method called edit distance which can be used to correct the spelling errors which will later be used for building spell corrector. An edit distance is a distance between two strings which is a non-negative integer number. We will learn the edit distance technique later in this course followed by Phonetic Hashing.