Tokenization in Natural Language Processing

In this lesson, we will study the concept of tokenization. In natural language processing (NLP), tokenization is the process of breaking down the raw text into smaller chunks called tokens. Generally, these tokens can be characters, words, and sentences.

Why we need tokenization?

As we all know that machines can only understand numeric data so to make them understand the raw text we first need to break it down into words and later we usually encode them into a numeric format based on their frequency using a Bag of Words or based on their relevance using TF-IDF that we will study later in the course. So, in a nutshell, tokenization is an intermediate step for converting raw text into a machine-understandable format.

In the NLP pipeline, tokenization is the very first step of data processing which helps in further analysis of textual data by extracting useful features from it.

Types of Tokenizers in NLP

There are different types of tokenizers that are used based on different scenarios. For example, if we are building a phishing email detector using NLP we first need to tokenize mail content into words using a word tokenizer. Similarly if we want to analyze a paragraph sentence by sentence then we have to use a sentence tokenizer.

The NLTK library in python supports the following type of tokenizers:

  1. Word Tokenizer
  2. Sentence Tokenizer
  3. Tweet Tokenizer
  4. Regex Tokenizer

1. Word Tokenizer

In this type of tokenizer, we split the text into individual words. To achieve this in python, we can use split() method which split the text into words using whitespaces by default. This method of word tokenization is also known as Whitespace tokenization. But in practice, this method doesn’t always give good results as it fails to split contraction words such as “can’t”, “hasn’t”, “wouldn’t” etc. These issues will be resolved if we use NLTK based word tokenizer. It handles contraction words well and it also handles words such as “o’clock” which is not a contraction word.

Regular_Expression_Basics1

As we can see from the above code the whitespace tokenizer is unable to identify the contraction word “I’ll” and also concatenated “.” with the words ‘home’ and ‘again’. On the other hand, NLTK’s word tokenizer not only breaks on whitespaces but also breaks contraction words such as I’ll into “I” and “‘ll” as well as it doesn’t break “o’clock” and treats it as a separate token.

2. Sentence Tokenizer

Tokenising based on a sentence requires us to split based on the period (‘.’). Let’s have a look at the NLTK sentence tokenizer in the below code.

Regular_Expression_Basics1