NLP | Location Tags Extraction
Last Updated :
26 Feb, 2019
Different kind of ChunkParserI subclass can be used to identify the LOCATION chunks. As it uses the gazetteers corpus to identify location words. The gazetteers corpus is a WordListCorpusReader class
that contains the following location words:
- Country names
- U.S. states and abbreviations
- Mexican states
- Major U.S. cities
- Canadian provinces
LocationChunker class
looking for words that are found in the gazetteers corpus by iterating over a tagged sentence. It creates a LOCATION chunk using IOB tags when it finds one or more location words. The IOB LOCATION tags are produced in the iob_locations()
and the parse()
method converts the IOB tags to Tree.
Code #1 : LocationChunker class
from nltk.chunk import ChunkParserI
from nltk.chunk.util import conlltags2tree
from nltk.corpus import gazetteers
class LocationChunker(ChunkParserI):
def __init__( self ):
self .locations = set (gazetteers.words())
self .lookahead = 0
for loc in self .locations:
nwords = loc.count( ' ' )
if nwords > self .lookahead:
self .lookahead = nwords
|
Code #2 : iob_locations() method
def iob_locations( self , tagged_sent):
i = 0
l = len (tagged_sent)
inside = False
while i < l:
word, tag = tagged_sent[i]
j = i + 1
k = j + self .lookahead
nextwords, nexttags = [], []
loc = False
while j < k:
if ' ' .join([word] + nextwords) in self .locations:
if inside:
yield word, tag, 'I-LOCATION'
else :
yield word, tag, 'B-LOCATION'
for nword, ntag in zip (nextwords, nexttags):
yield nword, ntag, 'I-LOCATION'
loc, inside = True , True
i = j
break
if j < l:
nextword, nexttag = tagged_sent[j]
nextwords.append(nextword)
nexttags.append(nexttag)
j + = 1
else :
break
if not loc:
inside = False
i + = 1
yield word, tag, 'O'
def parse( self , tagged_sent):
iobs = self .iob_locations(tagged_sent)
return conlltags2tree(iobs)
|
Code #3 : use the LocationChunker class to parse the sentence
from nltk.chunk import ChunkParserI
from chunkers import sub_leaves
from chunkers import LocationChunker
t = loc.parse([( 'San' , 'NNP' ), ( 'Francisco' , 'NNP' ),
( 'CA' , 'NNP' ), ( 'is' , 'BE' ), ( 'cold' , 'JJ' ),
( 'compared' , 'VBD' ), ( 'to' , 'TO' ), ( 'San' , 'NNP' ),
( 'Jose' , 'NNP' ), ( 'CA' , 'NNP' )])
print ( "Location : \n" , sub_leaves(t, 'LOCATION' ))
|
Output :
Location :
[[('San', 'NNP'), ('Francisco', 'NNP'), ('CA', 'NNP')],
[('San', 'NNP'), ('Jose', 'NNP'), ('CA', 'NNP')]]
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...