Skip to content
Related Articles

Related Articles

Improve Article
Save Article
Like Article

NLP | Location Tags Extraction

  • Last Updated : 26 Feb, 2019

Different kind of ChunkParserI subclass can be used to identify the LOCATION chunks. As it uses the gazetteers corpus to identify location words. The gazetteers corpus is a WordListCorpusReader class that contains the following location words:

  • Country names
  • U.S. states and abbreviations
  • Mexican states
  • Major U.S. cities
  • Canadian provinces

LocationChunker class looking for words that are found in the gazetteers corpus by iterating over a tagged sentence. It creates a LOCATION chunk using IOB tags when it finds one or more location words. The IOB LOCATION tags are produced in the iob_locations() and the parse() method converts the IOB tags to Tree.

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course

Code #1 : LocationChunker class

from nltk.chunk import ChunkParserI
from nltk.chunk.util import conlltags2tree
from nltk.corpus import gazetteers
class LocationChunker(ChunkParserI):
    def __init__(self):
        self.locations = set(gazetteers.words())
        self.lookahead = 0
        for loc in self.locations:
            nwords = loc.count(' ')
        if nwords > self.lookahead:
            self.lookahead = nwords

Code #2 : iob_locations() method

def iob_locations(self, tagged_sent):
    i = 0
    l = len(tagged_sent)
    inside = False
    while i < l:
        word, tag = tagged_sent[i]
        j = i + 1
        k = j + self.lookahead
        nextwords, nexttags = [], []
        loc = False
    while j < k:
        if ' '.join([word] + nextwords) in self.locations:
            if inside:
                yield word, tag, 'I-LOCATION'
                yield word, tag, 'B-LOCATION'
            for nword, ntag in zip(nextwords, nexttags):
                yield nword, ntag, 'I-LOCATION'
                loc, inside = True, True
                i = j
        if j < l:
            nextword, nexttag = tagged_sent[j]
            j += 1
        if not loc:
            inside = False
            i += 1
            yield word, tag, 'O'
    def parse(self, tagged_sent):
        iobs = self.iob_locations(tagged_sent)
        return conlltags2tree(iobs)

Code #3 : use the LocationChunker class to parse the sentence

from nltk.chunk import ChunkParserI
from chunkers import sub_leaves
from chunkers import LocationChunker
t = loc.parse([('San', 'NNP'), ('Francisco', 'NNP'),
               ('CA', 'NNP'), ('is', 'BE'), ('cold', 'JJ'), 
               ('compared', 'VBD'), ('to', 'TO'), ('San', 'NNP'),
               ('Jose', 'NNP'), ('CA', 'NNP')])
print ("Location : \n", sub_leaves(t, 'LOCATION'))

Output :

Location : 
[[('San', 'NNP'), ('Francisco', 'NNP'), ('CA', 'NNP')], 
[('San', 'NNP'), ('Jose', 'NNP'), ('CA', 'NNP')]]

My Personal Notes arrow_drop_up
Recommended Articles
Page :

Start Your Coding Journey Now!