Table of Contents
This article is part ongoing free NLP course. In the previous lesson, we studied Hidden Markov Model & its implementation in Python. In this lesson, we will explain in detail what is named entity recognition, the types of named entities, how named entity recognition works, IOB labeling in named entity recognition, types of named entity recognition techniques, applications of named entity recognition, what are the open-source industrial tools available for named entity recognition and finally will implement named entity recognition for a sample sentence using spaCy in Python.
Over the last couple of years, Named Entity Recognition (NER) has become an increasingly important task in natural language processing. With the growth of unstructured data on the internet, it has become essential to extract meaningful information from text data. Named Entity Recognition plays a vital role in this process by identifying and categorizing named entities in text.
One of the primary reasons for the growing need for NER is the explosive growth of data on the internet. The amount of unstructured data available on the web is staggering, and traditional methods of data analysis are insufficient to handle this volume of information. NER enables machines to understand the context in which named entities appear, making it easier to analyze and interpret large volumes of text data.
Another reason for the growing need for NER is the increasing use of natural language processing in various industries. NER plays a crucial role in tasks such as sentiment analysis, text classification, and machine translation. With the rise of chatbots and virtual assistants, NER has become even more important as machines need to understand and interpret user queries accurately.
1. What is Named Entity Recognition?
Named Entity Recognition (NER) is a subfield of Natural Language Processing (NLP) that involves identifying and classifying named entities in unstructured text data. Named entities refer to real-world entities present in the text. Some common examples of named entities are persons, organizations, locations, dates, and time expressions.
2. Types of Named Entities
Named entities can be broadly classified into the following types:
- Person: This category includes names of people, titles, and pronouns referring to people.
- Location: This category includes the names of countries, cities, streets, and other geographic locations.
- Organization: This category includes the names of companies, institutions, and government organizations.
- Time: This category includes dates, times, and durations.
- Money: This category includes monetary values such as currency, prices, and percentages.
- Miscellaneous: This category includes anything that doesn’t fall under the above categories or if we have any custom entities based on a specific domain.
3. How does Named Entity Recognition (NER) work?
NER is usually performed using machine learning or deep learning algorithms that are trained on annotated text data. Annotated text data is text data that has been labeled with named entity categories. Machine learning algorithms use these annotated datasets to learn patterns and identify named entities in raw text data.
The NER process typically involves three main steps:
1. Tokenization: In the tokenization step, the text data is split into individual words or tokens.
2. Part-of-speech tagging: In the part-of-speech tagging step, each token is labeled with its part of speech (noun, verb, adjective, etc.).
3. Entity classification: Finally, in the entity classification step, the named entities are identified and classified into predefined categories.
4. IOB Labelling in Named Entity Recognition
IOB labeling (short for Inside, Outside, Beginning labeling) is a popular technique used in Named Entity Recognition (NER) to annotate and categorize text data.
IOB labeling involves labeling each word or token in a text sequence with a specific tag that indicates whether it is part of a named entity or not.
In IOB labeling, each token in a sequence is assigned one of three tags:
- B (Beginning): the token marks the beginning of a named entity.
- I (Inside): the token is part of a named entity, but it is not the first token.
- O (Outside): the token is not part of a named entity.
Here’s an example of IOB labeling for a sentence:
Original sentence: “Bill works for Google in New York.”
IOB labeled sentence: “Bill works for B-ORG Google in B-LOC New York.“
In this example, “Bill” is not part of a named entity, so it is tagged with “O”. “Google” and “New York” are both named entities, so they are tagged with “B-ORG” (beginning of an organization) and “B-LOC” (beginning of a location), respectively. The rest of the tokens in the sentence are marked as “O”.
By using IOB labeling, NER models can accurately identify and extract named entities from unstructured text data, which is critical for many NLP applications such as information retrieval, text classification, and sentiment analysis.
5. Types of Named Entity Recognition
There are broadly two types of named entity recognition techniques:
1. Rule-based NER
Rule-based NER involves creating a set of rules or patterns that are used to identify named entities. These rules are based on things like the presence of certain words, the context of the words, and other linguistic features of the text. Rule-based systems are often used when there is a well-defined set of named entities that need to be extracted, and the text data is relatively structured.
Here’s an example of a rule-based NER system for identifying names in text:
Rule 1: If a word starts with a capital letter and is followed by one or more lowercase letters, it is likely a name.
Rule 2: If two or more words start with a capital letter and are followed by one or more lowercase letters, they are likely multi-word names.
Using these rules, we can identify names in text like “John Smith” and “Emily Johnson”.
2. Probabailistic NER
Probabilistic NER, on the other hand, uses machine learning algorithms to identify named entities. These algorithms are trained on large datasets of annotated text data, where each word is labeled with the appropriate named entity category. The algorithms use this labeled data to learn patterns and make predictions about which words are likely to be named entities.
Here’s an example of a probabilistic NER system for identifying organizations in text:
Training data: A large dataset of annotated text data where each word is labeled with the appropriate named entity category. For example, the word “Google” might be labeled as an organization.
Algorithm: Conditional Random Fields (CRF) is a popular algorithm used for probabilistic NER. CRF is a type of machine learning algorithm that can take into account the context of the text when making predictions.
Using this system, we can identify organizations in text like “Google” and “Microsoft”.
6. Applications of Named Entity Recognition
Named Entity Recognition has numerous applications in various fields. Some of the most common applications of NER include:
- Information Retrieval: NER can be used to extract structured information from unstructured text data, which can then be used for information retrieval.
- Information Extraction: NER can be used to extract important information from text data, such as events, dates, and times.
- Machine Learning: NER can be used as a feature for machine learning algorithms to improve the accuracy of text classification tasks.
- Sentiment Analysis: NER can be used to identify and extract named entities related to specific sentiments, which can then be used to perform sentiment analysis.
7. Open-source libraries/tools for NER
There are 4 best open-source tools available for named entity recognition.
- SNER (Stanford Named Entity Recognizer) is a tool developed by Stanford University, which is based on the Conditional Random Fields (CRF) algorithm and provides pre-trained models for entity extraction. It is written in JAVA and offers a standard library for developers to use.
- SpaCy is a Python-based framework that is widely used for creating statistical systems, particularly custom Named Entity Recognition (NER) extractors. It is known for its speed and ease of use, making it a popular choice among developers.
- NLTK (Natural Language Toolkit) is another popular library for Python developers working on Natural Language Processing (NLP) tasks. It comes with its own classifiers, including the ne_chunk function. NLTK can also be used with the Stanford NER Tagger for Python, providing a wide range of NER options for developers to choose from.
- Flair is a lightweight and user-friendly framework designed specifically for Natural Language Processing (NLP) tasks. It is built on top of the powerful PyTorch deep learning framework and is capable of supporting over 250 different languages. Flair is particularly useful for training small models, thanks to its simplicity and ease of use.
NER Implementation in Python
In this demo, we will demonstrate named entity recognition using spaCy in Python.
For using spaCy we first need to install spacy using the below commands.
pip install -U pip setuptools wheel pip install -U spacy python -m spacy download en_core_web_sm
Next, for getting named entities recognition on sample sentences, we will be using a pre-trained spaCy model i.e., en_core_web_sm
Sentence to analyze:
John was born and raised in New York City, one of the most bustling and diverse cities in the world. Growing up, he always had a fascination with different cultures and languages, which led him to pursue a degree in international studies. After graduation, John decided to travel the world and immerse himself in different cultures. He spent several months living in Paris, where he fell in love with the language, architecture, and cuisine. During his stay, John made many new friends, including Marie, a local artist who showed him around the city and introduced him to her favorite spots, such as the Louvre Museum and the Champs-Élysées. John also enjoyed exploring the city on his own, taking long walks along the Seine River and admiring the Eiffel Tower at night. After his time in Paris, John continued his travels, visiting other locations such as Tokyo, Sydney, and Rio de Janeiro. But Paris remained his favorite city, and he vowed to return someday.
Python Code
import spacy from spacy import displacy nlp = spacy.load('en_core_web_sm') raw_text = "John was born and raised in New York City, one of the most bustling and diverse cities in the world. Growing up, he always had a fascination with different cultures and languages, which led him to pursue a degree in international studies. After graduation, John decided to travel the world and immerse himself in different cultures. He spent several months living in Paris, where he fell in love with the language, architecture, and cuisine. During his stay, John made many new friends, including Marie, a local artist who showed him around the city and introduced him to her favorite spots, such as the Louvre Museum and the Champs-Élysées. John also enjoyed exploring the city on his own, taking long walks along the Seine River and admiring the Eiffel Tower at night. After his time in Paris, John continued his travels, visiting other locations such as Tokyo, Sydney, and Rio de Janeiro. But Paris remained his favorite city, and he vowed to return someday." doc = nlp(raw_text) displacy.render(doc, style="ent")
As we can observe from the above output, it has detected PERSON, GPE(Geo-Political Entity), ORG, DATE, TIME, and FAC( Buildings, airports, highways, bridges, etc. )
Conclusion
Named Entity Recognition is a powerful technique that can be used to extract important information from unstructured text data. By identifying and classifying named entities, NER can be used for various applications such as information retrieval, information extraction, machine learning, and sentiment analysis. With the help of machine learning algorithms, NER is becoming more accurate and effective, making it a valuable tool for anyone working with large amounts of text data.