Named entity recognition (NER) is one of the most data preprocessing tasks. It involves the identification of key information in the text and classification into a set of predefined categories. An entity is basically the thing that is consistently talked about or referred to in the text.
NER is the form of NLP.
At its core, NLP is just a two-step process, below are the two steps that are involved:
- Detecting the entities from the text
- Classifying them into different categories
Some of the categories that are the most important architecture in NER that:
- Person
- Organization
- Place/ location
Other common tasks include classifying the following:
- date/time.
- expression
- Numeral measurement (money, percent, weight, etc)
- E-mail address
Ambiguity in NE
For a person, the category definition is intuitively quite clear, but for computers, there is some ambiguity in classification. Let’s look at some ambiguous examples:
England (Organisation) won the 2019 world cup vs The 2019 world cup happened in England(Location).
Washington(Location) is the capital of the US The first president of the US was Washington(Person).
Implementation:
In this implementation, we will perform Named Entity Recognition using two different frameworks: Spacy and NLTK. This code can be run on colab, however for visualization purposes. I recommend the local environment. We can install the following frameworks using pip install
First, we performed Named Entity recognition using Spacy.
# command to run before code
! pip install spacy
! pip install nltk
! python -m spacy download en_core_web_sm
# imports and load spacy english language package
import spacy
from spacy import displacy
from spacy import tokenizer
nlp = spacy.load('en_core_web_sm')
#Load the text and process it
# I copied the text from python wiki
text =("Python is an interpreted, high-level and general-purpose programming language
"Pythons design philosophy emphasizes code readability with"
"its notable use of significant indentation."
"Its language constructs and object-oriented approach aim to"
"help programmers write clear and"
"logical code for small and large-scale projects")
# text2 = # copy the paragraphs from https://www.python.org/doc/essays/
doc = nlp(text)
#doc2 = nlp(text2)
sentences = list(doc.sents)
print(sentences)
# tokenization
for token in doc:
print(token.text)
# print entities
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)
# now we use displaycy function on doc2
displacy.render(doc, style='ent', jupyter=True)