Phonetic Hashing Technique with Soundex Algorithm in Python

As we already discussed in our previous article on Canonicalization in NLP that there are certain words that have different pronunciations in different languages. As a result, their spelling differs in such cases. For example, the Indian surname ‘Agrawal’ has various spellings and pronunciations such as ‘Agarwal’, ‘Aggarwal’, etc. If we perform stemming or lemmatization to these words the problem of redundant tokens will still remain and we treat these words differently but actually they mean a common context only. Hence, we need to reduce all the variations of a particular word to a common word.

To achieve this objective we have to use a technique called Phonetic Hashing. In Phonetic Hashing, we try to reduce all the variations of the word to a common word using hashing technique.

Phonetic hashing is performed using the Soundex algorithm. American Soundex is the most popular Soundex algorithm. It groups British and American spellings of a word to a common code. It works on the pronunciation of the word i.e., as long as the words sound similar, it will assign the same hash code.

Now, let’s see the Soundex of the word ‘Bangalore’. To calculate the hash code, we’ll make changes to the same word, in-place, as follows:

  1. Phonetic hashing is a four-letter code. The first letter of the code is retained as is i.e., the first letter of the code is the actual first letter of the input word. The first character of the phonetic hash is ‘B’. Now, we need to make changes to the rest of the letters of the word.
  2. Next, we have to map all the consonant letters (except the first letter). All the vowels are written as is and ‘H’s, ‘Y’s, and ‘W’s remain unencoded i..e, they will be removed from the code altogether. After mapping based on the below mapping, the code now becomes ‘Ba524o6e’
Soundex Algorithm consonant mapping

3. In the next step, we will remove all the vowels from the code. In the code, we have a’s, o’s, and e’s as vowels. After removing all the vowels from the code the resulting code will become ‘B5246‘.

4. Final step is to make the code four letters. In case, the resulting code is less than a four-letter code, we can pad it with zeroes whereas if the code is more than 4 letters we can truncate the code from the right side. In our case, the code is 5 letter code so we will truncate the last digit in order to make it a 4-letter code. The final code will be ‘B524’.

Python Implementation of Soundex Algorithm

As we have seen that the process of getting Soundex code is fixed, we can simply create a python function to create a Soundex code of any given input word.

def get_soundex(word):
    """Get the soundex code for the string"""
    token = word.upper()

    soundex = ""
    # first letter of input is always the first letter of soundex
    soundex += token[0]
    # create a dictionary which maps letters to respective soundex codes. Vowels and 'H', 'W' and 'Y' will be represented by '.'
    dictionary = {"BFPV": "1", "CGJKQSXZ":"2", "DT":"3", "L":"4", "MN":"5", "R":"6", "AEIOUHWY":"."}

    for char in token[1:]:
        for key in dictionary.keys():
            if char in key:
                code = dictionary[key]
                if code != soundex[-1]:
                    soundex += code

    # remove vowels and 'H', 'W' and 'Y' from soundex
    soundex = soundex.replace(".", "")
    # trim or pad to make soundex a 4-character code
    soundex = soundex[:4].ljust(4, "0")
    return soundex

Now, let’s see the Soundex of Bangalore and Bengaluru.


Now, Let’s check the Soundex of ‘Aggrawal’, ‘Agrawal’, ‘Aggarwal’ and ‘Agarwal’


So, as we have seen that the similar sounding variations of different words are accurately reduced to common Soundex code which solves the problem of handling different spellings based on different pronunciations.

In the next article, we will study the edit distance method for handing misspellings and will use this method for building spell corrector in python.

Proceed to Edit Distance Method in NLP using Python

Leave a Comment