NLP | Extracting Named Entities

Recognizing named entity is a specific kind of chunk extraction that uses entity tags along with chunk tags.
Common entity tags include PERSON, LOCATION and ORGANIZATION. POS tagged sentences are parsed into chunk trees with normal chunking but the trees labels can be entity tags in place of chunk phrase tags. NLTK has already a pre-trained named entity chunker which can be used using ne_chunk() method in the nltk.chunk module. This method chunks a single sentence into a Tree.

Code #1 : Using ne-chunk() on tagged sentence of the treebank_chunk corpus

filter_none

edit
close

play_arrow

link
brightness_4
code

from nltk.corpus import treebank_chunk
from nltk.chunk import ne_chunk
  
ne_chunk(treebank_chunk.tagged_sents()[0])

chevron_right


Output :

Tree('S', [Tree('PERSON', [('Pierre', 'NNP')]), Tree('ORGANIZATION', 
[('Vinken', 'NNP')]), (', ', ', '), ('61', 'CD'), ('years', 'NNS'), 
('old', 'JJ'), (', ', ', '), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'),
('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), 
('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')])

two entity tags are found: PERSON and ORGANIZATION. Each of these subtrees contains a list of the words that are recognized as a PERSON or ORGANIZATION.
 
Code #2 : Method to extract named entites using leaves of all the subtrees

filter_none

edit
close

play_arrow

link
brightness_4
code

def sub_leaves(tree, label):
    return [t.leaves() 
            for t in tree.subtrees(
                    lambda s: label() == label)]

chevron_right


 
Code #3 : using method to get all the PERSON or ORGANIZATION leaves from a tree

filter_none

edit
close

play_arrow

link
brightness_4
code

tree = ne_chunk(treebank_chunk.tagged_sents()[0])
  
from chunkers import sub_leaves
print ("Named entities of PERSON : "
       sub_leaves(tree, 'PERSON'))
  
print ("\nNamed entites of ORGANIZATION : "
       sub_leaves(tree, 'ORGANIZATION'))

chevron_right


Output :

Named entities of PERSON : [[('Pierre', 'NNP')]]

Named entites of ORGANIZATION : [[('Vinken', 'NNP')]]

To process multiple sentences at a time, chunk_ne_sents() is used. In the code below, first 10 sentences from treebank_chunk.tagged_sents() are processed to get ORGANIZATION sub_leaves().
 
Code #4 : Let’s understand chunk_ne_sents()

filter_none

edit
close

play_arrow

link
brightness_4
code

from nltk.chunk import chunk_ne_sents
from nltk.corpus import treebank_chunk
  
trees = chunk_ne_sents(treebank_chunk.tagged_sents()[:10])
[sub_leaves(t, 'ORGANIZATION') for t in trees]

chevron_right


Output :

[[[('Vinken', 'NNP')]], [[('Elsevier', 'NNP')]], [[('Consolidated', 'NNP'), 
('Gold', 'NNP'), ('Fields', 'NNP')]], [], [], [[('Inc.', 'NNP')], 
[('Micronite', 'NN')]], [[('New', 'NNP'), ('England', 'NNP'),
('Journal', 'NNP')]], [[('Lorillard', 'NNP')]], [], []]


My Personal Notes arrow_drop_up

Aspire to Inspire before I expire

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.